# FINE-TUNING & EVALUATION PIPELINE FOR BINARY “Answer vs. Non-Answer” DETECTION 🚀

This notebook is the most comprehensive one in the repo - as it contains all 12 model finetunings and tests. First, the Mono-Criterion (single-metric) models are tuned and tested, and then the Dual-Criterion (loss + F₁) models across the three model regimes (Pretrained-ALL, PLDQA, and BERT-base).

Each section:
1. Checks GPU availability.
2. Loads pre-split train/val/test CSVs.
3. Tokenizes spans and builds Hugging Face datasets.
4. Defines metrics, W&B sweeps, and early stopping callbacks.
5. Runs hyperparameter searches or loads best runs from W&B.
6. Evaluates on the held-out test set (saving metrics, classification reports, confusion matrices).

**Note on redundancy**: Due to Colab runtime instability, each fine-tuning block (e.g., ALL, PLDQA, or BERT-base alone) is repeated after kernel restarts so that all experiments remain in one document. Each time the runtime disconnected, the notebook had to reload dependencies and re-run from scratch, which is why many install/import sections appear multiple times.


In [None]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

Mon Apr 21 10:56:47 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   58C    P8             11W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

# MONO-CRITERION MODELS (FINE-TUNING AND TESTING)

## A. Pretrained ALL

In [None]:
!pip install "numpy<2.0" # run + restart kernel if Colab throws a bug reg. numpy ds datasets

In [None]:
!pip install transformers datasets evaluate wandb -q

import pandas as pd
import torch
import numpy as np
from datasets import Dataset
from transformers import (
    AutoTokenizer, AutoModelForSequenceClassification,
    Trainer, TrainingArguments, EarlyStoppingCallback
)
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import StratifiedShuffleSplit
import evaluate
import wandb

print("💻 Device in use:", torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU only")

# ─── 2) Load & preprocess ─────────────────────────

df = pd.read_csv("/content/df_coalesced_labels_feature_engineered_18_april.csv")
df['label_binary'] = (df['label'] == 'answer').astype(int)
df = df.dropna(subset=['span', 'full_text'])
df['prev_turn'] = df['prev_turn'].fillna('')
df['span']      = df['span'].fillna('')
df['next_turn'] = df['next_turn'].fillna('')
df['input_text'] = df['span']

# ─── 3) Build group_meta for stratification ──────

group_meta = (
    df.groupby('debate_unit_id')
      .agg({
        'is_government': lambda x: x.mode()[0],
        'span':         lambda x: np.mean(x.str.split().str.len())
      })
      .reset_index()
)
group_meta['length_bin'] = pd.qcut(group_meta['span'], q=3,
                                   labels=['short','medium','long'])
group_meta['stratify_key'] = (
    group_meta['is_government'].astype(str) + "_" +
    group_meta['length_bin'].astype(str)
)

# ─── 4) Triple split: outer (test) then inner (val) ────

# 4a) Outer: trainval vs test
outer = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
tv_idx, test_idx = next(outer.split(group_meta, group_meta['stratify_key']))
trainval_meta = group_meta.iloc[tv_idx].reset_index(drop=True)
test_meta     = group_meta.iloc[test_idx].reset_index(drop=True)

# 4b) Inner: train vs val (on trainval_meta)
inner = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
train_idx, val_idx = next(inner.split(trainval_meta, trainval_meta['stratify_key']))
train_meta = trainval_meta.iloc[train_idx].reset_index(drop=True)
val_meta   = trainval_meta.iloc[val_idx].reset_index(drop=True)

# 4c) Map back to the main DataFrame
df_train = df[df.debate_unit_id.isin(train_meta.debate_unit_id)].copy()
df_val   = df[df.debate_unit_id.isin(val_meta  .debate_unit_id)].copy()
df_test  = df[df.debate_unit_id.isin(test_meta .debate_unit_id)].copy()

# 4d) DEBUG: print distributions before tokenization/training
print("Train label distribution:", df_train['label_binary'].value_counts(normalize=True))
print("Val   label distribution:", df_val  ['label_binary'].value_counts(normalize=True))
print("Test  label distribution:", df_test ['label_binary'].value_counts(normalize=True))

print("Train avg span length:", df_train['span'].str.split().str.len().mean())
print("Val   avg span length:", df_val  ['span'].str.split().str.len().mean())
print("Test  avg span length:", df_test ['span'].str.split().str.len().mean())

print("Train gov dist:", df_train['is_government'].value_counts(normalize=True))
print("Val   gov dist:", df_val  ['is_government'].value_counts(normalize=True))
print("Test  gov dist:", df_test ['is_government'].value_counts(normalize=True))

# Check for overlapping debate_unit_id across all splits
train_ids = set(df_train['debate_unit_id'])
val_ids   = set(df_val  ['debate_unit_id'])
test_ids  = set(df_test ['debate_unit_id'])

overlap_train_val  = train_ids & val_ids
overlap_train_test = train_ids & test_ids
overlap_val_test   = val_ids   & test_ids

print(f"Overlap train/val IDs:  {len(overlap_train_val)}")
print(f"Overlap train/test IDs: {len(overlap_train_test)}")
print(f"Overlap val/test IDs:   {len(overlap_val_test)}")

if overlap_train_val:
    print("Example train–val overlap:", list(overlap_train_val)[:10])
if overlap_train_test:
    print("Example train–test overlap:", list(overlap_train_test)[:10])
if overlap_val_test:
    print("Example val–test overlap:",   list(overlap_val_test)[:10])


Train label distribution: label_binary
0    0.543478
1    0.456522
Name: proportion, dtype: float64
Val   label distribution: label_binary
0    0.526882
1    0.473118
Name: proportion, dtype: float64
Test  label distribution: label_binary
0    0.583333
1    0.416667
Name: proportion, dtype: float64
Train avg span length: 37.72010869565217
Val   avg span length: 40.43010752688172
Test  avg span length: 39.57575757575758
Train gov dist: is_government
False    0.692935
True     0.307065
Name: proportion, dtype: float64
Val   gov dist: is_government
False    0.677419
True     0.322581
Name: proportion, dtype: float64
Test  gov dist: is_government
False    0.689394
True     0.310606
Name: proportion, dtype: float64
Overlap train/val IDs:  0
Overlap train/test IDs: 0
Overlap val/test IDs:   0


In [None]:
# ─── 4e) Save splits to disk ────────────────
df_train.to_csv("/content/df_train.csv", index=False)
df_val  .to_csv("/content/df_val.csv",   index=False)
df_test .to_csv("/content/df_test.csv",  index=False)
print("Saved df_train.csv, df_val.csv, df_test.csv to /content/")

In [None]:
# ─── 5) Tokenizer & tokenize fn ───────────────────

model_checkpoint = "./danish-bert-adapted/danish-bert-adapted" # PRETRAINED ALL model uploaded to disk
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

def tokenize(batch):
    return tokenizer(batch["input_text"],
                     truncation=True,
                     padding="max_length",
                     max_length=512)


all_cols = ['input_text','label_binary']
hf_train = Dataset.from_pandas(df_train[all_cols].reset_index(drop=True))
hf_val   = Dataset.from_pandas(df_val  [all_cols].reset_index(drop=True))
hf_test  = Dataset.from_pandas(df_test [all_cols].reset_index(drop=True))

# map, rename, and rebind each split
hf_train = (
    hf_train
    .map(tokenize, batched=True)
    .rename_column("label_binary", "labels")
)
hf_val = (
    hf_val
    .map(tokenize, batched=True)
    .rename_column("label_binary", "labels")
)
hf_test = (
    hf_test
    .map(tokenize, batched=True)
    .rename_column("label_binary", "labels")
)

# now explicitly set the columns wanted returned
hf_train.set_format("torch", columns=["input_ids", "attention_mask", "labels"])
hf_val  .set_format("torch", columns=["input_ids", "attention_mask", "labels"])
hf_test .set_format("torch", columns=["input_ids", "attention_mask", "labels"])


# ─── 7) Metrics & callback ────────────────────────

accuracy = evaluate.load("accuracy")
f1       = evaluate.load("f1")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    return {
        "accuracy": accuracy.compute(predictions=preds, references=labels)["accuracy"],
        "f1":       f1.compute(predictions=preds, references=labels, average="macro")["f1"]
    }

from transformers import TrainerCallback
class TrainMetricsCallback(TrainerCallback):
    def __init__(self, trainer=None):
        self.trainer = trainer

        ## We use predict() here on the *training* set purely for logging.
        # WARNING: Hugging Face’s WandB integration will log these under "test/…"
        # even though this is training-data performance, not true test-set metrics.
    def on_epoch_end(self, args, state, control, **kwargs):
        if not self.trainer: return
        pred = self.trainer.predict(self.trainer.train_dataset)
        p = np.argmax(pred.predictions, axis=-1)
        l = pred.label_ids
        wandb.log({
            "train/accuracy": accuracy.compute(predictions=p, references=l)["accuracy"],
            "train/f1":       f1.compute(predictions=p, references=l, average="macro")["f1"],
            "train/loss":     pred.metrics["test_loss"],
            "epoch":          state.epoch
        })

In [None]:
# ─── 8) W&B sweep config ──────────────────────────

sweep_config = {
    'method': 'bayes',
    'metric': {'name': 'eval/f1', 'goal': 'maximize'},
    'parameters': {
        'per_device_train_batch_size': {'values': [2,4,6,8,16]},
        'learning_rate':              {'min': 5e-6, 'max': 3e-5},
        'num_train_epochs':           {'values':[3,4,5,6,7,8]},
        'weight_decay_hyperparam':    {'min':0.01, 'max':0.25},
        'warmup_ratio':               {'values':[0.0,0.06,0.1]},
        # Adding these for some extra checks
        #'dropout':                    {'values':[0.1,0.2,0.3]},
        #'lr_scheduler_type':          {'values':['linear','cosine']}


    }
}

sweep_id = wandb.sweep(sweep_config,
                      project="danish-bert-answer-all-pretrained-all-binary") # before danish-bert-answer-all-pretrained-all-binary

# ─── 9) Sweep training fn (uses hf_val for eval) ───

def train_sweep():
    wandb.init()
    config = AutoModelForSequenceClassification.from_pretrained(
        model_checkpoint
    ).config
    config.num_labels = 2
    #config.hidden_dropout_prob          = wandb.config.dropout
    #config.attention_probs_dropout_prob = wandb.config.dropout

    args = TrainingArguments(
        output_dir="./results",
        eval_strategy="epoch",
        save_strategy="epoch",
        logging_strategy="epoch",
        learning_rate=wandb.config.learning_rate,
        per_device_train_batch_size=wandb.config.per_device_train_batch_size,
        per_device_eval_batch_size=wandb.config.per_device_train_batch_size,
        num_train_epochs=wandb.config.num_train_epochs,
        weight_decay=wandb.config.weight_decay_hyperparam,
        warmup_ratio=wandb.config.warmup_ratio,
        #lr_scheduler_type      =wandb.config.lr_scheduler_type,

        load_best_model_at_end=True,
        #metric_for_best_model="f1",
        metric_for_best_model="eval_f1",
        greater_is_better = True,
        report_to=["wandb"]#,
        #remove_unused_columns=False
    )

    model = AutoModelForSequenceClassification.from_pretrained(
        model_checkpoint, config=config
    )
    trainer = Trainer(
        model=model,
        args=args,
        train_dataset=hf_train,
        eval_dataset =hf_val,                 # <<< validation split
        compute_metrics=compute_metrics,
        callbacks=[EarlyStoppingCallback(
                        early_stopping_patience=2)] #3 before
    )
    trainer.add_callback(TrainMetricsCallback(trainer=trainer))
    trainer.train()
    trainer.save_model("./best_sweep_model")

# Run 30 agents
wandb.agent(sweep_id, train_sweep, count=30)


### A.1 Pretrained ALL: Load-in testing

In [None]:
# --- 10) Loading the best sweep model and evaluate on hf_test - if saved and seen before Colab runtime ran out:
# from transformers import Trainer, TrainingArguments, AutoModelForSequenceClassification
# import numpy as np
# from sklearn.metrics import classification_report, confusion_matrix

# # 1) Loading the model artifact that our sweep saved
# best_model = AutoModelForSequenceClassification.from_pretrained("./best_sweep_model")

# # Lets take a look
# best_model

# --- If runtime ran out, load best model (highest eval/f1 with this initial modelling) directly from WandB and rerun with sweep code again
#(or if model object was saved to disk, upload in a folder to Colab)
import wandb

# 1) Login & point at sweep
wandb.login()
api      = wandb.Api()
project  = "pernillebrams/danish-bert-answer-all-pretrained-all-binary"
sweep_id = "xrjzmicm"
sweep    = api.sweep(f"{project}/{sweep_id}")

# 2) Find the run with the highest eval/f1 (single criterion)
best_run = max(
    (run for run in sweep.runs if run.summary.get("eval/f1") is not None),
    key=lambda r: r.summary["eval/f1"]
)
best_f1 = best_run.summary["eval/f1"]
print(f"Selected run: {best_run.id} (name={best_run.name}), eval/f1 = {best_f1:.4f}")

# 3) Extract its hyperparameters
hp_keys = [
    "learning_rate",
    "per_device_train_batch_size",
    "num_train_epochs",
    "weight_decay_hyperparam",
    "warmup_ratio",
    # add "lr_scheduler_type" or "dropout" here if needed (for followups)
]
best_hp = {k: best_run.config[k] for k in hp_keys}
print("Best hyperparameters extracted:", best_hp)


Selected run: jxlomuhr (name=northern-sweep-24), eval/f1 = 0.7843
Best hyperparameters extracted: {'learning_rate': 2.4769523137950903e-05, 'per_device_train_batch_size': 4, 'num_train_epochs': 4, 'weight_decay_hyperparam': 0.017479036752271887, 'warmup_ratio': 0}


In [None]:
# Plots for it here directly on the WandB page
best_run

In [None]:
# ─── 8) Evaluate on hf_test ────────────────────────────────────────────────────
import numpy as np
import pandas as pd
import os
from sklearn.metrics import classification_report, confusion_matrix
from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments

# 8a) Load the model
model = AutoModelForSequenceClassification.from_pretrained("./best_sweep_model") # uploaded from disk

# 8b) Prepare the Trainer for evaluation
test_args = TrainingArguments(
    output_dir                ="./results_test_pretrained_all_load_in",
    per_device_eval_batch_size=16,
    report_to=[]
)
test_trainer = Trainer(
    model           = model,
    args            = test_args,
    compute_metrics = compute_metrics  # as defined earlier
)

# 8c) Run prediction → returns PredictionOutput
test_output = test_trainer.predict(hf_test)
print("→ Final Test metrics:", test_output.metrics)

# ensure output directory exists
os.makedirs(test_args.output_dir, exist_ok=True)

# 8d) Save the JSON of all test metrics
test_trainer.save_metrics("test", test_output.metrics)
print(f"Saved all_results.json in {test_args.output_dir}")

# 8e) Save raw predictions
preds = test_output.predictions
np.save(os.path.join(test_args.output_dir, "test_predictions.npy"), preds)
print(f"Saved test_predictions.npy in {test_args.output_dir}")

# 8f) Classification report → CSV
y_true = df_test["label_binary"].to_numpy()
y_pred = np.argmax(preds, axis=-1)
clf_dict = classification_report(y_true, y_pred, output_dict=True)
print(classification_report(y_true, y_pred))

pd.DataFrame(clf_dict).T.to_csv(
    os.path.join(test_args.output_dir, "classification_report_pretrained_all_load_in.csv")
)
print("Saved classification_report_pretrained_all_load_in.csv")

# 8g) Confusion matrix → CSV
cm = confusion_matrix(y_true, y_pred)
print("Confusion matrix:\n", cm)
cm_df = pd.DataFrame(
    cm,
    index=["true_0", "true_1"],
    columns=["pred_0", "pred_1"]
)
cm_df.to_csv(
    os.path.join(test_args.output_dir, "confusion_matrix_all_load_in.csv")
)
print("Saved confusion_matrix_all_load_in.csv")


→ Final Test metrics: {'test_loss': 0.5250300765037537, 'test_model_preparation_time': 0.005, 'test_accuracy': 0.7424242424242424, 'test_f1': 0.728395061728395, 'test_runtime': 4.5192, 'test_samples_per_second': 29.209, 'test_steps_per_second': 1.992}
Saved all_results.json in ./results_test_pretrained_all_load_in
Saved test_predictions.npy in ./results_test_pretrained_all_load_in
              precision    recall  f1-score   support

           0       0.75      0.83      0.79        77
           1       0.72      0.62      0.67        55

    accuracy                           0.74       132
   macro avg       0.74      0.72      0.73       132
weighted avg       0.74      0.74      0.74       132

Saved classification_report_pretrained_all_load_in.csv
Confusion matrix:
 [[64 13]
 [21 34]]
Saved confusion_matrix_all_load_in.csv


### A.2 Pretraining ALL: Retraining

In [None]:
# Checking the checkpoint
model_checkpoint

'./danish-bert-adapted/danish-bert-adapted'

In [None]:
from transformers import (
    Trainer,
    TrainingArguments,
    TrainerCallback,
    AutoConfig,
    AutoModelForSequenceClassification
)
import numpy as np
import pandas as pd
from datasets import concatenate_datasets
from sklearn.metrics import accuracy_score, f1_score

# 1) Combine train + val
hf_trainval = concatenate_datasets([hf_train, hf_val])

# 2) Use the same best_hp from sweep‐selection step
# e.g. best_hp = { "learning_rate": ..., "per_device_train_batch_size": ..., ... }
print("Re-training with hyperparameters:", best_hp)

# 3) Initialize a fresh model from the original pretrained_ALL checkpoint (defined earlier)
config = AutoConfig.from_pretrained(model_checkpoint, num_labels=2)
model  = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, config=config)

# 4) Define a training‐only metrics function
def compute_metrics_train(p):
    preds = np.argmax(p.predictions, axis=1)
    return {
        "train_accuracy": accuracy_score(p.label_ids, preds),
        "train_f1":       f1_score(p.label_ids, preds, average="weighted"),
        # Note: Trainer logs train_loss automatically in p.metrics
    }

# 5) Callback to log per‐epoch training metrics
class TrainLoggingCallback(TrainerCallback):
    def __init__(self):
        self.history = []
    def on_epoch_end(self, args, state, control, **kwargs):
        m = self.trainer.evaluate(self.trainer.train_dataset, metric_key_prefix="train")
        self.history.append(dict(epoch=state.epoch, **m))
        print(
            f"Epoch {int(state.epoch)} | "
            f"loss {m['train_loss']:.4f} "
            f"acc  {m['train_accuracy']:.4f} "
            f"f1   {m['train_f1']:.4f}"
        )

cb = TrainLoggingCallback()

# 6) Set up and run Trainer without any eval_dataset
trainer = Trainer(
    model           = model,
    args            = TrainingArguments(
        output_dir                 = "./best_final_retrained_pretrained_all_model",
        learning_rate              = best_hp["learning_rate"],
        per_device_train_batch_size= best_hp["per_device_train_batch_size"],
        num_train_epochs           = best_hp["num_train_epochs"],
        weight_decay               = best_hp["weight_decay_hyperparam"],
        warmup_ratio               = best_hp["warmup_ratio"],
        #lr_scheduler_type          = best_hp.get("lr_scheduler_type","linear"), # not used atm

        eval_strategy              = "no",    # skip validation
        logging_strategy           = "no",
        save_strategy              = "epoch",
        report_to                  = []       # turn off W&B
    ),
    train_dataset   = hf_trainval,
    compute_metrics = compute_metrics_train,
    callbacks       = [cb]
)
cb.trainer = trainer

trainer.train() # While this runs it shows a table incl the header Validation loss - this is just the generic table-setting from trainer, but it is train loss
trainer.save_model("./best_final_retrained_pretrained_all_model")  # writes config.json, pytorch_model.bin, etc.

# 7) Inspect & save the training history
train_df = pd.DataFrame(cb.history).set_index("epoch")
train_df.to_csv("train_history_pretrained_all_retrained.csv")
print(train_df)


  block_group = [InMemoryTable(cls._concat_blocks(list(block_group), axis=axis))]
  table = cls._concat_blocks(blocks, axis=0)


Re-training with hyperparameters: {'learning_rate': 2.4769523137950903e-05, 'per_device_train_batch_size': 4, 'num_train_epochs': 4, 'weight_decay_hyperparam': 0.017479036752271887, 'warmup_ratio': 0}


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at ./danish-bert-adapted/danish-bert-adapted and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss,Validation Loss,Accuracy,F1
116,No log,0.355092,0.895879,0.8957
232,No log,0.185183,0.930586,0.930697
348,No log,0.059594,0.97397,0.973978
464,No log,0.033288,0.986985,0.98698


Epoch 1 | loss 0.3551 acc  0.8959 f1   0.8957
Epoch 2 | loss 0.1852 acc  0.9306 f1   0.9307
Epoch 3 | loss 0.0596 acc  0.9740 f1   0.9740
Epoch 4 | loss 0.0333 acc  0.9870 f1   0.9870
       train_accuracy  train_f1  train_loss
epoch                                      
1.0          0.895879  0.895700    0.355092
2.0          0.930586  0.930697    0.185183
3.0          0.973970  0.973978    0.059594
4.0          0.986985  0.986980    0.033288


In [None]:
# ─── 8) Evaluate Retrained Model on hf_test ─────────────────────────
import os
import numpy as np
import pandas as pd
from sklearn.metrics import classification_report, confusion_matrix
from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments

# 8a) Load the retrained checkpoint
model = AutoModelForSequenceClassification.from_pretrained(
    "./best_final_retrained_pretrained_all_model"
)

# 8b) Prepare the Trainer for evaluation
test_args = TrainingArguments(
    output_dir                ="./results_test_pretrained_all_retrained",
    per_device_eval_batch_size=16,
    report_to=[]
)
test_trainer = Trainer(
    model           = model,
    args            = test_args,
    compute_metrics = compute_metrics    # the original compute_metrics fn
)

# 8c) Run prediction → returns PredictionOutput
test_output = test_trainer.predict(hf_test)
print("→ Final Retrained Test metrics:", test_output.metrics)

# ensure output directory exists
os.makedirs(test_args.output_dir, exist_ok=True)

# 8d) Save the JSON of all test metrics
test_trainer.save_metrics("test", test_output.metrics)
print(f"Saved all_results.json in {test_args.output_dir}")

# 8e) Save raw predictions
preds = test_output.predictions
np.save(os.path.join(test_args.output_dir, "test_predictions.npy"), preds)
print(f"Saved test_predictions.npy in {test_args.output_dir}")

# 8f) Classification report → CSV
y_true = df_test["label_binary"].to_numpy()
y_pred = np.argmax(preds, axis=-1)
print(classification_report(y_true, y_pred))
clf_dict = classification_report(y_true, y_pred, output_dict=True)
pd.DataFrame(clf_dict).T.to_csv(
    os.path.join(test_args.output_dir, "classification_report_pretrained_all_retrained.csv")
)
print("Saved classification_report_pretrained_all_retrained.csv")

# 8g) Confusion matrix → CSV
cm = confusion_matrix(y_true, y_pred)
print("Confusion matrix:\n", cm)
cm_df = pd.DataFrame(
    cm,
    index  =["true_0", "true_1"],
    columns=["pred_0", "pred_1"]
)
cm_df.to_csv(
    os.path.join(test_args.output_dir, "confusion_matrix_pretrained_all_retrained_.csv")
)
print("Saved confusion_matrix_pretrained_all_retrained_.csv")


→ Final Retrained Test metrics: {'test_loss': 0.7074925303459167, 'test_model_preparation_time': 0.0029, 'test_accuracy': 0.7727272727272727, 'test_f1': 0.7673872180451128, 'test_runtime': 4.5043, 'test_samples_per_second': 29.305, 'test_steps_per_second': 1.998}
Saved all_results.json in ./results_test_pretrained_all_retrained
Saved test_predictions.npy in ./results_test_pretrained_all_retrained
              precision    recall  f1-score   support

           0       0.81      0.79      0.80        77
           1       0.72      0.75      0.73        55

    accuracy                           0.77       132
   macro avg       0.77      0.77      0.77       132
weighted avg       0.77      0.77      0.77       132

Saved classification_report_pretrained_all_retrained.csv
Confusion matrix:
 [[61 16]
 [14 41]]
Saved confusion_matrix_pretrained_all_retrained_.csv


## B. PLDQA

In [None]:
!pip install "numpy<2.0" # run + restart kernel if Colab throws a bug reg. numpy ds datasets

In [None]:
!pip install transformers datasets evaluate wandb -q

import pandas as pd
import torch
import numpy as np
from datasets import Dataset
from transformers import (
    AutoTokenizer, AutoModelForSequenceClassification,
    Trainer, TrainingArguments, EarlyStoppingCallback
)
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import StratifiedShuffleSplit
import evaluate
import wandb

print("💻 Device in use:", torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU only")

# As these were constructed and saved in last one, we'll just load them in this time from disk + debug
df_train = pd.read_csv("/content/df_train.csv")
df_val   = pd.read_csv("/content/df_val.csv")
df_test  = pd.read_csv("/content/df_test.csv")

# 4d) DEBUG: print distributions before tokenization/training
print("Train label distribution:", df_train['label_binary'].value_counts(normalize=True))
print("Val   label distribution:", df_val  ['label_binary'].value_counts(normalize=True))
print("Test  label distribution:", df_test ['label_binary'].value_counts(normalize=True))

print("Train avg span length:", df_train['span'].str.split().str.len().mean())
print("Val   avg span length:", df_val  ['span'].str.split().str.len().mean())
print("Test  avg span length:", df_test ['span'].str.split().str.len().mean())

print("Train gov dist:", df_train['is_government'].value_counts(normalize=True))
print("Val   gov dist:", df_val  ['is_government'].value_counts(normalize=True))
print("Test  gov dist:", df_test ['is_government'].value_counts(normalize=True))

# Check for overlapping debate_unit_id across all splits
train_ids = set(df_train['debate_unit_id'])
val_ids   = set(df_val  ['debate_unit_id'])
test_ids  = set(df_test ['debate_unit_id'])

overlap_train_val  = train_ids & val_ids
overlap_train_test = train_ids & test_ids
overlap_val_test   = val_ids   & test_ids

print(f"Overlap train/val IDs:  {len(overlap_train_val)}")
print(f"Overlap train/test IDs: {len(overlap_train_test)}")
print(f"Overlap val/test IDs:   {len(overlap_val_test)}")

if overlap_train_val:
    print("Example train–val overlap:", list(overlap_train_val)[:10])
if overlap_train_test:
    print("Example train–test overlap:", list(overlap_train_test)[:10])
if overlap_val_test:
    print("Example val–test overlap:",   list(overlap_val_test)[:10])


Train label distribution: label_binary
0    0.543478
1    0.456522
Name: proportion, dtype: float64
Val   label distribution: label_binary
0    0.526882
1    0.473118
Name: proportion, dtype: float64
Test  label distribution: label_binary
0    0.583333
1    0.416667
Name: proportion, dtype: float64
Train avg span length: 37.72010869565217
Val   avg span length: 40.43010752688172
Test  avg span length: 39.57575757575758
Train gov dist: is_government
False    0.692935
True     0.307065
Name: proportion, dtype: float64
Val   gov dist: is_government
False    0.677419
True     0.322581
Name: proportion, dtype: float64
Test  gov dist: is_government
False    0.689394
True     0.310606
Name: proportion, dtype: float64
Overlap train/val IDs:  0
Overlap train/test IDs: 0
Overlap val/test IDs:   0


In [None]:
# ─── 5) Tokenizer & tokenize fn ───────────────────

model_checkpoint = "./danish-bert-adapted-pldqa/danish-bert-adapted-pldqa" # PRETRAINED PLDQA model uploaded to disk
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

def tokenize(batch):
    return tokenizer(batch["input_text"],
                     truncation=True,
                     padding="max_length",
                     max_length=512)


all_cols = ['input_text','label_binary']
hf_train = Dataset.from_pandas(df_train[all_cols].reset_index(drop=True))
hf_val   = Dataset.from_pandas(df_val  [all_cols].reset_index(drop=True))
hf_test  = Dataset.from_pandas(df_test [all_cols].reset_index(drop=True))

# map, rename, and rebind each split
hf_train = (
    hf_train
    .map(tokenize, batched=True)
    .rename_column("label_binary", "labels")
)
hf_val = (
    hf_val
    .map(tokenize, batched=True)
    .rename_column("label_binary", "labels")
)
hf_test = (
    hf_test
    .map(tokenize, batched=True)
    .rename_column("label_binary", "labels")
)

# now explicitly set the columns wanted returned
hf_train.set_format("torch", columns=["input_ids", "attention_mask", "labels"])
hf_val  .set_format("torch", columns=["input_ids", "attention_mask", "labels"])
hf_test .set_format("torch", columns=["input_ids", "attention_mask", "labels"])


# ─── 7) Metrics & callback ────────────────────────

accuracy = evaluate.load("accuracy")
f1       = evaluate.load("f1")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    return {
        "accuracy": accuracy.compute(predictions=preds, references=labels)["accuracy"],
        "f1":       f1.compute(predictions=preds, references=labels, average="macro")["f1"]
    }

from transformers import TrainerCallback
class TrainMetricsCallback(TrainerCallback):
    def __init__(self, trainer=None):
        self.trainer = trainer

        ## We use predict() here on the *training* set purely for logging.
        # WARNING: Hugging Face’s WandB integration will log these under "test/…"
        # even though this is training-data performance, not true test-set metrics.
    def on_epoch_end(self, args, state, control, **kwargs):
        if not self.trainer: return
        pred = self.trainer.predict(self.trainer.train_dataset)
        p = np.argmax(pred.predictions, axis=-1)
        l = pred.label_ids
        wandb.log({
            "train/accuracy": accuracy.compute(predictions=p, references=l)["accuracy"],
            "train/f1":       f1.compute(predictions=p, references=l, average="macro")["f1"],
            "train/loss":     pred.metrics["test_loss"],
            "epoch":          state.epoch
        })

In [None]:
# ─── 8) W&B sweep config ──────────────────────────

sweep_config = {
    'method': 'bayes',
    'metric': {'name': 'eval/f1', 'goal': 'maximize'},
    'parameters': {
        'per_device_train_batch_size': {'values': [2,4,6,8,16]},
        'learning_rate':              {'min': 5e-6, 'max': 3e-5},
        'num_train_epochs':           {'values':[3,4,5,6,7,8]},
        'weight_decay_hyperparam':    {'min':0.01, 'max':0.25},
        'warmup_ratio':               {'values':[0.0,0.06,0.1]},
        # Adding these for some extra checks
        #'dropout':                    {'values':[0.1,0.2,0.3]},
        #'lr_scheduler_type':          {'values':['linear','cosine']}


    }
}

sweep_id = wandb.sweep(sweep_config,
                      project="danish-bert-answer-pldqa-pretrained-pldqa-binary")

# ─── 9) Sweep training fn (uses hf_val for eval) ───
def train_sweep():
    wandb.init()
    config = AutoModelForSequenceClassification.from_pretrained(
        model_checkpoint
    ).config
    config.num_labels = 2
    #config.hidden_dropout_prob          = wandb.config.dropout
    #config.attention_probs_dropout_prob = wandb.config.dropout

    args = TrainingArguments(
        output_dir="./results",
        eval_strategy="epoch",
        save_strategy="epoch",
        logging_strategy="epoch",
        learning_rate=wandb.config.learning_rate,
        per_device_train_batch_size=wandb.config.per_device_train_batch_size,
        per_device_eval_batch_size=wandb.config.per_device_train_batch_size,
        num_train_epochs=wandb.config.num_train_epochs,
        weight_decay=wandb.config.weight_decay_hyperparam,
        warmup_ratio=wandb.config.warmup_ratio,
        #lr_scheduler_type      =wandb.config.lr_scheduler_type,

        load_best_model_at_end=True,
        #metric_for_best_model="f1",
        metric_for_best_model="eval_f1",
        greater_is_better = True,
        report_to=["wandb"]#,
        #remove_unused_columns=False
    )

    model = AutoModelForSequenceClassification.from_pretrained(
        model_checkpoint, config=config
    )
    trainer = Trainer(
        model=model,
        args=args,
        train_dataset=hf_train,
        eval_dataset =hf_val,                 # <<< validation split
        compute_metrics=compute_metrics,
        callbacks=[EarlyStoppingCallback(
                        early_stopping_patience=2)] #3 before
    )
    trainer.add_callback(TrainMetricsCallback(trainer=trainer))
    trainer.train()
    trainer.save_model("./best_sweep_model")

# Run 30 agents
wandb.agent(sweep_id, train_sweep, count=30)


### B.1 PLDQA: Load-in testing

In [None]:
# --- 10) Loading the best sweep model and evaluate on hf_test - if saved and seen before Colab runtime ran out:
# from transformers import Trainer, TrainingArguments, AutoModelForSequenceClassification
# import numpy as np
# from sklearn.metrics import classification_report, confusion_matrix

# # 1) Loading the model artifact that our sweep saved
# best_model = AutoModelForSequenceClassification.from_pretrained("./best_sweep_model")

# # Lets take a look
# best_model

# --- If runtime ran out, load best model (highest eval/f1 with this initial modelling) directly from WandB and rerun with sweep code again
#(or if model object was saved to disk, upload in a folder to Colab)
import wandb

# 1) Login & point at sweep
wandb.login()
api      = wandb.Api()
project  = "pernillebrams/danish-bert-answer-pldqa-pretrained-pldqa-binary"
sweep_id = "8ouy284v"
sweep    = api.sweep(f"{project}/{sweep_id}")

# 2) Find the run with the highest eval/f1 (single criterion)
best_run = max(
    (run for run in sweep.runs if run.summary.get("eval/f1") is not None),
    key=lambda r: r.summary["eval/f1"]
)
best_f1 = best_run.summary["eval/f1"]
print(f"Selected run: {best_run.id} (name={best_run.name}), eval/f1 = {best_f1:.4f}")

# 3) Extract its hyperparameters
hp_keys = [
    "learning_rate",
    "per_device_train_batch_size",
    "num_train_epochs",
    "weight_decay_hyperparam",
    "warmup_ratio",
    # add "lr_scheduler_type" or "dropout" here if needed (for followups)
]
best_hp = {k: best_run.config[k] for k in hp_keys}
print("Best hyperparameters extracted:", best_hp)


Selected run: pglmcxv7 (name=fallen-sweep-17), eval/f1 = 0.7953
Best hyperparameters extracted: {'learning_rate': 2.837463202056722e-05, 'per_device_train_batch_size': 6, 'num_train_epochs': 6, 'weight_decay_hyperparam': 0.15008810256010646, 'warmup_ratio': 0.06}


In [None]:
# Plots for it here directly on the WandB page
best_run

In [None]:
# The model files were not saved properly for this one to disk, so I need to rerun that exact sweep to get the model files

# Check the checkpoint
model_checkpoint

'./danish-bert-adapted-pldqa/danish-bert-adapted-pldqa'

In [None]:
# Preparing and rerunning the exact sweep trial
import wandb, numpy as np, evaluate
from transformers import (
    AutoModelForSequenceClassification,
    Trainer,
    TrainingArguments,
    EarlyStoppingCallback
)

# ─── 1) Compute‐metrics fn ─────────────────────────────────────────────────
accuracy = evaluate.load("accuracy")
f1       = evaluate.load("f1")
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    return {
        "accuracy": accuracy.compute(predictions=preds, references=labels)["accuracy"],
        "eval_f1":  f1.compute(predictions=preds, references=labels, average="macro")["f1"]
    }

# ─── 2) Start a new W&B run with the best_hp ───────────────────────────────
wandb.init(
    project="danish-bert-answer-pldqa-pretrained-pldqa-binary",
    entity="pernillebrams",
    name="manual-rerun-best-pretrained-pldqa",
    config=best_hp,
    reinit=True
)

# ─── 3) Build the model & args ──────────────────────────────────────────────
config = AutoModelForSequenceClassification.from_pretrained(model_checkpoint).config
config.num_labels = 2
model  = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, config=config)

args = TrainingArguments(
    output_dir                ="./best_rerun_model_pldqa",
    learning_rate             = best_hp["learning_rate"],
    per_device_train_batch_size=best_hp["per_device_train_batch_size"],
    per_device_eval_batch_size =best_hp["per_device_train_batch_size"],
    num_train_epochs          = best_hp["num_train_epochs"],
    weight_decay              = best_hp["weight_decay_hyperparam"],
    warmup_ratio              = best_hp["warmup_ratio"],
    #lr_scheduler_type         = best_hp.get("lr_scheduler_type","linear"),

    eval_strategy             ="epoch",
    save_strategy             ="epoch",
    logging_strategy          ="epoch",
    load_best_model_at_end    = True,
    metric_for_best_model     = "eval_f1",
    greater_is_better         = True,

    report_to                 = ["wandb"]
)

# ─── 4) Trainer with F₁‐only EarlyStopping ──────────────────────────────────
trainer = Trainer(
    model           = model,
    args            = args,
    train_dataset   = hf_train,
    eval_dataset    = hf_val,
    compute_metrics = compute_metrics,
    callbacks       = [EarlyStoppingCallback(early_stopping_patience=2)]
)

# ─── 5) Train & Save ────────────────────────────────────────────────────────
trainer.train()
trainer.save_model("./best_rerun_model_pldqa")

wandb.finish()

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at ./danish-bert-adapted-pldqa/danish-bert-adapted-pldqa and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at ./danish-bert-adapted-pldqa/danish-bert-adapted-pldqa and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,F1,Accuracy
1,0.6606,0.598657,0.660088,0.677419
2,0.4459,0.585015,0.698378,0.709677
3,0.2146,0.673058,0.763194,0.763441
4,0.0419,0.858233,0.762093,0.763441
5,0.0134,0.958464,0.741187,0.741935


0,1
eval/accuracy,▁▄██▆
eval/f1,▁▄██▇
eval/loss,▁▁▃▆█
eval/runtime,▁█▃▄▆
eval/samples_per_second,█▁▆▅▃
eval/steps_per_second,█▁▆▅▃
train/epoch,▁▁▃▃▅▅▆▆███
train/global_step,▁▁▃▃▅▅▆▆███
train/grad_norm,▃█▂▁▁
train/learning_rate,█▆▄▃▁

0,1
eval/accuracy,0.74194
eval/f1,0.74119
eval/loss,0.95846
eval/runtime,3.4611
eval/samples_per_second,26.87
eval/steps_per_second,4.623
total_flos,484124341862400.0
train/epoch,5.0
train/global_step,310.0
train/grad_norm,0.01151


In [None]:
# ─── 8) Evaluate on hf_test ────────────────────────────────────────────────────
import numpy as np
import pandas as pd
import os
from sklearn.metrics import classification_report, confusion_matrix
from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments

# 8a) Load the model
model = AutoModelForSequenceClassification.from_pretrained("./best_sweep_model") # uploaded from disk

# 8b) Prepare the Trainer for evaluation
test_args = TrainingArguments(
    output_dir                ="./results_test_pretrained_pldqa_load_in",
    per_device_eval_batch_size=16,
    report_to=[]
)
test_trainer = Trainer(
    model           = model,
    args            = test_args,
    compute_metrics = compute_metrics  # as defined earlier
)

# 8c) Run prediction → returns PredictionOutput
test_output = test_trainer.predict(hf_test)
print("→ Final Test metrics:", test_output.metrics)

# ensure output directory exists
os.makedirs(test_args.output_dir, exist_ok=True)

# 8d) Save the JSON of all test metrics
test_trainer.save_metrics("test", test_output.metrics)
print(f"Saved all_results.json in {test_args.output_dir}")

# 8e) Save raw predictions
preds = test_output.predictions
np.save(os.path.join(test_args.output_dir, "test_predictions.npy"), preds)
print(f"Saved test_predictions.npy in {test_args.output_dir}")

# 8f) Classification report → CSV
y_true = df_test["label_binary"].to_numpy()
y_pred = np.argmax(preds, axis=-1)
clf_dict = classification_report(y_true, y_pred, output_dict=True)
print(classification_report(y_true, y_pred))

pd.DataFrame(clf_dict).T.to_csv(
    os.path.join(test_args.output_dir, "classification_report_pretrained_pldqa_load_in.csv")
)
print("Saved classification_report_pretrained_pldqa_load_in.csv")

# 8g) Confusion matrix → CSV
cm = confusion_matrix(y_true, y_pred)
print("Confusion matrix:\n", cm)
cm_df = pd.DataFrame(
    cm,
    index=["true_0", "true_1"],
    columns=["pred_0", "pred_1"]
)
cm_df.to_csv(
    os.path.join(test_args.output_dir, "confusion_matrix_pldqa_load_in.csv")
)
print("Saved confusion_matrix_pldqa_load_in.csv")


→ Final Test metrics: {'test_loss': 0.7684226036071777, 'test_model_preparation_time': 0.0046, 'test_accuracy': 0.6742424242424242, 'test_eval_f1': 0.6733237410071943, 'test_runtime': 4.463, 'test_samples_per_second': 29.576, 'test_steps_per_second': 2.017}
Saved all_results.json in ./results_test_pretrained_pldqa_load_in
Saved test_predictions.npy in ./results_test_pretrained_pldqa_load_in
              precision    recall  f1-score   support

           0       0.85      0.53      0.66        77
           1       0.57      0.87      0.69        55

    accuracy                           0.67       132
   macro avg       0.71      0.70      0.67       132
weighted avg       0.74      0.67      0.67       132

Saved classification_report_pretrained_pldqa_load_in.csv
Confusion matrix:
 [[41 36]
 [ 7 48]]
Saved confusion_matrix_pldqa_load_in.csv


### B.2 PLDQA: Retrain

In [None]:
# Checking the checkpoint
model_checkpoint

'./danish-bert-adapted-pldqa/danish-bert-adapted-pldqa'

In [None]:
from transformers import (
    Trainer,
    TrainingArguments,
    TrainerCallback,
    AutoConfig,
    AutoModelForSequenceClassification
)
import numpy as np
import pandas as pd
from datasets import concatenate_datasets
from sklearn.metrics import accuracy_score, f1_score

# 1) Combine train + val
hf_trainval = concatenate_datasets([hf_train, hf_val])

# 2) Use the same best_hp from sweep‐selection step
# e.g. best_hp = { "learning_rate": ..., "per_device_train_batch_size": ..., ... }
print("Re-training with hyperparameters:", best_hp)

# 3) Initialize a fresh model from the original pretrained_ALL checkpoint (defined earlier)
config = AutoConfig.from_pretrained(model_checkpoint, num_labels=2)
model  = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, config=config)

# 4) Define a training‐only metrics function
def compute_metrics_train(p):
    preds = np.argmax(p.predictions, axis=1)
    return {
        "train_accuracy": accuracy_score(p.label_ids, preds),
        "train_f1":       f1_score(p.label_ids, preds, average="weighted"),
        # Note: Trainer logs train_loss automatically in p.metrics
    }

# 5) Callback to log per‐epoch training metrics
class TrainLoggingCallback(TrainerCallback):
    def __init__(self):
        self.history = []
    def on_epoch_end(self, args, state, control, **kwargs):
        m = self.trainer.evaluate(self.trainer.train_dataset, metric_key_prefix="train")
        self.history.append(dict(epoch=state.epoch, **m))
        print(
            f"Epoch {int(state.epoch)} | "
            f"loss {m['train_loss']:.4f} "
            f"acc  {m['train_accuracy']:.4f} "
            f"f1   {m['train_f1']:.4f}"
        )

cb = TrainLoggingCallback()

# 6) Set up and run Trainer without any eval_dataset
trainer = Trainer(
    model           = model,
    args            = TrainingArguments(
        output_dir                 = "./best_final_retrained_pldqa_model",
        learning_rate              = best_hp["learning_rate"],
        per_device_train_batch_size= best_hp["per_device_train_batch_size"],
        num_train_epochs           = best_hp["num_train_epochs"],
        weight_decay               = best_hp["weight_decay_hyperparam"],
        warmup_ratio               = best_hp["warmup_ratio"],
        #lr_scheduler_type          = best_hp.get("lr_scheduler_type","linear"), # not used atm

        eval_strategy              = "no",    # skip validation
        logging_strategy           = "no",
        save_strategy              = "epoch",
        report_to                  = []       # turn off W&B
    ),
    train_dataset   = hf_trainval,
    compute_metrics = compute_metrics_train,
    callbacks       = [cb]
)
cb.trainer = trainer

trainer.train() # While this runs it shows a table incl the header Validation loss - this is just the generic table-setting from trainer, but it is train loss
trainer.save_model("./best_final_retrained_pldqa_model")  # writes config.json, pytorch_model.bin, etc.

# 7) Inspect & save the training history
train_df = pd.DataFrame(cb.history).set_index("epoch")
train_df.to_csv("train_history_pldqa_retrained.csv")
print(train_df)


  block_group = [InMemoryTable(cls._concat_blocks(list(block_group), axis=axis))]
  table = cls._concat_blocks(blocks, axis=0)


Re-training with hyperparameters: {'learning_rate': 2.837463202056722e-05, 'per_device_train_batch_size': 6, 'num_train_epochs': 6, 'weight_decay_hyperparam': 0.15008810256010646, 'warmup_ratio': 0.06}


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at ./danish-bert-adapted-pldqa/danish-bert-adapted-pldqa and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss,Validation Loss,Accuracy,F1
77,No log,0.455253,0.78308,0.769407
154,No log,0.145619,0.958785,0.958682
231,No log,0.066357,0.978308,0.978336
308,No log,0.008789,0.997831,0.99783
385,No log,0.004874,0.997831,0.99783
462,No log,0.002959,0.997831,0.99783


Epoch 1 | loss 0.4553 acc  0.7831 f1   0.7694
Epoch 2 | loss 0.1456 acc  0.9588 f1   0.9587
Epoch 3 | loss 0.0664 acc  0.9783 f1   0.9783
Epoch 4 | loss 0.0088 acc  0.9978 f1   0.9978
Epoch 5 | loss 0.0049 acc  0.9978 f1   0.9978
Epoch 6 | loss 0.0030 acc  0.9978 f1   0.9978
       train_accuracy  train_f1  train_loss
epoch                                      
1.0          0.783080  0.769407    0.455253
2.0          0.958785  0.958682    0.145619
3.0          0.978308  0.978336    0.066357
4.0          0.997831  0.997830    0.008789
5.0          0.997831  0.997830    0.004874
6.0          0.997831  0.997830    0.002959


In [None]:
# ─── 8) Evaluate Retrained Model on hf_test ─────────────────────────
import os
import numpy as np
import pandas as pd
from sklearn.metrics import classification_report, confusion_matrix
from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments

# 8a) Load the retrained checkpoint
model = AutoModelForSequenceClassification.from_pretrained(
    "./best_final_retrained_pldqa_model"
)

# 8b) Prepare the Trainer for evaluation
test_args = TrainingArguments(
    output_dir                ="./results_test_pldqa_retrained",
    per_device_eval_batch_size=16,
    report_to=[]
)
test_trainer = Trainer(
    model           = model,
    args            = test_args,
    compute_metrics = compute_metrics    # the original compute_metrics fn
)

# 8c) Run prediction → returns PredictionOutput
test_output = test_trainer.predict(hf_test)
print("→ Final Retrained Test metrics:", test_output.metrics)

# ensure output directory exists
os.makedirs(test_args.output_dir, exist_ok=True)

# 8d) Save the JSON of all test metrics
test_trainer.save_metrics("test", test_output.metrics)
print(f"Saved all_results.json in {test_args.output_dir}")

# 8e) Save raw predictions
preds = test_output.predictions
np.save(os.path.join(test_args.output_dir, "test_predictions.npy"), preds)
print(f"Saved test_predictions.npy in {test_args.output_dir}")

# 8f) Classification report → CSV
y_true = df_test["label_binary"].to_numpy()
y_pred = np.argmax(preds, axis=-1)
print(classification_report(y_true, y_pred))
clf_dict = classification_report(y_true, y_pred, output_dict=True)
pd.DataFrame(clf_dict).T.to_csv(
    os.path.join(test_args.output_dir, "classification_report_pldqa_retrained.csv")
)
print("Saved classification_report_pldqa_retrained.csv")

# 8g) Confusion matrix → CSV
cm = confusion_matrix(y_true, y_pred)
print("Confusion matrix:\n", cm)
cm_df = pd.DataFrame(
    cm,
    index  =["true_0", "true_1"],
    columns=["pred_0", "pred_1"]
)
cm_df.to_csv(
    os.path.join(test_args.output_dir, "confusion_matrix_pldqa_retrained_.csv")
)
print("Saved confusion_matrix_pldqa_retrained_.csv")


→ Final Retrained Test metrics: {'test_loss': 0.8373656868934631, 'test_model_preparation_time': 0.0027, 'test_accuracy': 0.7727272727272727, 'test_eval_f1': 0.7684210526315789, 'test_runtime': 4.3613, 'test_samples_per_second': 30.266, 'test_steps_per_second': 2.064}
Saved all_results.json in ./results_test_pldqa_retrained
Saved test_predictions.npy in ./results_test_pldqa_retrained
              precision    recall  f1-score   support

           0       0.82      0.78      0.80        77
           1       0.71      0.76      0.74        55

    accuracy                           0.77       132
   macro avg       0.77      0.77      0.77       132
weighted avg       0.78      0.77      0.77       132

Saved classification_report_pldqa_retrained.csv
Confusion matrix:
 [[60 17]
 [13 42]]
Saved confusion_matrix_pldqa_retrained_.csv


## C. BERT BASE (baseline)

In [None]:
!pip install "numpy<2.0" # run + restart kernel if Colab throws a bug reg. numpy ds datasets

In [None]:
!pip install transformers datasets evaluate wandb -q

import pandas as pd
import torch
import numpy as np
from datasets import Dataset
from transformers import (
    AutoTokenizer, AutoModelForSequenceClassification,
    Trainer, TrainingArguments, EarlyStoppingCallback
)
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import StratifiedShuffleSplit
import evaluate
import wandb

print("💻 Device in use:", torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU only")

# As these were constructed and saved in last one, we'll just load them in this time from disk + debug
df_train = pd.read_csv("/content/df_train.csv")
df_val   = pd.read_csv("/content/df_val.csv")
df_test  = pd.read_csv("/content/df_test.csv")

# 4d) DEBUG: print distributions before tokenization/training
print("Train label distribution:", df_train['label_binary'].value_counts(normalize=True))
print("Val   label distribution:", df_val  ['label_binary'].value_counts(normalize=True))
print("Test  label distribution:", df_test ['label_binary'].value_counts(normalize=True))

print("Train avg span length:", df_train['span'].str.split().str.len().mean())
print("Val   avg span length:", df_val  ['span'].str.split().str.len().mean())
print("Test  avg span length:", df_test ['span'].str.split().str.len().mean())

print("Train gov dist:", df_train['is_government'].value_counts(normalize=True))
print("Val   gov dist:", df_val  ['is_government'].value_counts(normalize=True))
print("Test  gov dist:", df_test ['is_government'].value_counts(normalize=True))

# Check for overlapping debate_unit_id across all splits
train_ids = set(df_train['debate_unit_id'])
val_ids   = set(df_val  ['debate_unit_id'])
test_ids  = set(df_test ['debate_unit_id'])

overlap_train_val  = train_ids & val_ids
overlap_train_test = train_ids & test_ids
overlap_val_test   = val_ids   & test_ids

print(f"Overlap train/val IDs:  {len(overlap_train_val)}")
print(f"Overlap train/test IDs: {len(overlap_train_test)}")
print(f"Overlap val/test IDs:   {len(overlap_val_test)}")

if overlap_train_val:
    print("Example train–val overlap:", list(overlap_train_val)[:10])
if overlap_train_test:
    print("Example train–test overlap:", list(overlap_train_test)[:10])
if overlap_val_test:
    print("Example val–test overlap:",   list(overlap_val_test)[:10])


In [None]:
# ─── 5) Tokenizer & tokenize fn ───────────────────
model_checkpoint = "Maltehb/danish-bert-botxo" # Loading from huggingface directly
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

def tokenize(batch):
    return tokenizer(batch["input_text"],
                     truncation=True,
                     padding="max_length",
                     max_length=512)


all_cols = ['input_text','label_binary']
hf_train = Dataset.from_pandas(df_train[all_cols].reset_index(drop=True))
hf_val   = Dataset.from_pandas(df_val  [all_cols].reset_index(drop=True))
hf_test  = Dataset.from_pandas(df_test [all_cols].reset_index(drop=True))

# map, rename, and rebind each split
hf_train = (
    hf_train
    .map(tokenize, batched=True)
    .rename_column("label_binary", "labels")
)
hf_val = (
    hf_val
    .map(tokenize, batched=True)
    .rename_column("label_binary", "labels")
)
hf_test = (
    hf_test
    .map(tokenize, batched=True)
    .rename_column("label_binary", "labels")
)

# now explicitly set the columns wanted returned
hf_train.set_format("torch", columns=["input_ids", "attention_mask", "labels"])
hf_val  .set_format("torch", columns=["input_ids", "attention_mask", "labels"])
hf_test .set_format("torch", columns=["input_ids", "attention_mask", "labels"])


# ─── 7) Metrics & callback ────────────────────────
accuracy = evaluate.load("accuracy")
f1       = evaluate.load("f1")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    return {
        "accuracy": accuracy.compute(predictions=preds, references=labels)["accuracy"],
        "f1":       f1.compute(predictions=preds, references=labels, average="macro")["f1"]
    }

from transformers import TrainerCallback
class TrainMetricsCallback(TrainerCallback):
    def __init__(self, trainer=None):
        self.trainer = trainer

        ## We use predict() here on the *training* set purely for logging.
        # WARNING: Hugging Face’s WandB integration will log these under "test/…"
        # even though this is training-data performance, not true test-set metrics.
    def on_epoch_end(self, args, state, control, **kwargs):
        if not self.trainer: return
        pred = self.trainer.predict(self.trainer.train_dataset)
        p = np.argmax(pred.predictions, axis=-1)
        l = pred.label_ids
        wandb.log({
            "train/accuracy": accuracy.compute(predictions=p, references=l)["accuracy"],
            "train/f1":       f1.compute(predictions=p, references=l, average="macro")["f1"],
            "train/loss":     pred.metrics["test_loss"],
            "epoch":          state.epoch
        })

In [None]:
# ─── 8) W&B sweep config ──────────────────────────

sweep_config = {
    'method': 'bayes',
    'metric': {'name': 'eval/f1', 'goal': 'maximize'},
    'parameters': {
        'per_device_train_batch_size': {'values': [2,4,6,8,16]},
        'learning_rate':              {'min': 5e-6, 'max': 3e-5},
        'num_train_epochs':           {'values':[3,4,5,6,7,8]},
        'weight_decay_hyperparam':    {'min':0.01, 'max':0.25},
        'warmup_ratio':               {'values':[0.0,0.06,0.1]},
        # Adding these for some extra checks
        #'dropout':                    {'values':[0.1,0.2,0.3]},
        #'lr_scheduler_type':          {'values':['linear','cosine']}


    }
}

sweep_id = wandb.sweep(sweep_config,
                      project="danish-bert-answer-base-danish-bert-binary")

# ─── 9) Sweep training fn (uses hf_val for eval) ───
def train_sweep():
    wandb.init()
    config = AutoModelForSequenceClassification.from_pretrained(
        model_checkpoint
    ).config
    config.num_labels = 2
    #config.hidden_dropout_prob          = wandb.config.dropout
    #config.attention_probs_dropout_prob = wandb.config.dropout

    args = TrainingArguments(
        output_dir="./results",
        eval_strategy="epoch",
        save_strategy="epoch",
        logging_strategy="epoch",
        learning_rate=wandb.config.learning_rate,
        per_device_train_batch_size=wandb.config.per_device_train_batch_size,
        per_device_eval_batch_size=wandb.config.per_device_train_batch_size,
        num_train_epochs=wandb.config.num_train_epochs,
        weight_decay=wandb.config.weight_decay_hyperparam,
        warmup_ratio=wandb.config.warmup_ratio,
        #lr_scheduler_type      =wandb.config.lr_scheduler_type,

        load_best_model_at_end=True,
        #metric_for_best_model="f1",
        metric_for_best_model="eval_f1",
        greater_is_better = True,
        report_to=["wandb"]#,
        #remove_unused_columns=False
    )

    model = AutoModelForSequenceClassification.from_pretrained(
        model_checkpoint, config=config
    )
    trainer = Trainer(
        model=model,
        args=args,
        train_dataset=hf_train,
        eval_dataset =hf_val,                 # <<< validation split
        compute_metrics=compute_metrics,
        callbacks=[EarlyStoppingCallback(
                        early_stopping_patience=2)] #3 before
    )
    trainer.add_callback(TrainMetricsCallback(trainer=trainer))
    trainer.train()
    trainer.save_model("./best_sweep_model")

# Run 30 agents
wandb.agent(sweep_id, train_sweep, count=30)


### C.1 BERT-base (baseline): Load-in testing

In [None]:
# --- 10) Loading the best sweep model and evaluate on hf_test - if saved and seen before Colab runtime ran out:
# from transformers import Trainer, TrainingArguments, AutoModelForSequenceClassification
# import numpy as np
# from sklearn.metrics import classification_report, confusion_matrix

# # 1) Loading the model artifact that our sweep saved
# best_model = AutoModelForSequenceClassification.from_pretrained("./best_sweep_model")

# # Lets take a look
# best_model

# --- If runtime ran out, load best model (highest eval/f1 with this initial modelling) directly from WandB and rerun with sweep code again
#(or if model object was saved to disk, upload in a folder to Colab)
import wandb

# 1) Login & point at sweep
wandb.login()
api      = wandb.Api()
project  = "pernillebrams/danish-bert-answer-base-danish-bert-binary"
sweep_id = "dci7w3rc"
sweep    = api.sweep(f"{project}/{sweep_id}")

# 2) Find the run with the highest eval/f1 (single criterion)
best_run = max(
    (run for run in sweep.runs if run.summary.get("eval/f1") is not None),
    key=lambda r: r.summary["eval/f1"]
)
best_f1 = best_run.summary["eval/f1"]
print(f"Selected run: {best_run.id} (name={best_run.name}), eval/f1 = {best_f1:.4f}")

# 3) Extract its hyperparameters
hp_keys = [
    "learning_rate",
    "per_device_train_batch_size",
    "num_train_epochs",
    "weight_decay_hyperparam",
    "warmup_ratio",
    # add "lr_scheduler_type" or "dropout" here if needed (for followups)
]
best_hp = {k: best_run.config[k] for k in hp_keys}
print("Best hyperparameters extracted:", best_hp)


Selected run: ytej22tl (name=classic-sweep-29), eval/f1 = 0.7956
Best hyperparameters extracted: {'learning_rate': 2.6217449527811408e-05, 'per_device_train_batch_size': 6, 'num_train_epochs': 7, 'weight_decay_hyperparam': 0.2024789742475012, 'warmup_ratio': 0.06}


In [None]:
# Plots for it here directly on the WandB page
best_run

In [None]:
# The model files were not saved properly for this one to disk, so I need to rerun that exact sweep to get the model files

# Check the checkpoint
model_checkpoint

'Maltehb/danish-bert-botxo'

In [None]:
# Preparing and rerunning the exact sweep trial
import wandb, numpy as np, evaluate
from transformers import (
    AutoModelForSequenceClassification,
    Trainer,
    TrainingArguments,
    EarlyStoppingCallback
)

# ─── 1) Compute‐metrics fn ─────────────────────────────────────────────────
accuracy = evaluate.load("accuracy")
f1       = evaluate.load("f1")
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    return {
        "accuracy": accuracy.compute(predictions=preds, references=labels)["accuracy"],
        "eval_f1":  f1.compute(predictions=preds, references=labels, average="macro")["f1"]
    }

# ─── 2) Start a new W&B run with the best_hp ───────────────────────────────
wandb.init(
    project="danish-bert-answer-base-danish-bert-binary",
    entity="pernillebrams",
    name="manual-rerun-best-pretrained-bert-base",
    config=best_hp,
    reinit=True
)

# ─── 3) Build the model & args ──────────────────────────────────────────────
config = AutoModelForSequenceClassification.from_pretrained(model_checkpoint).config
config.num_labels = 2
model  = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, config=config)

args = TrainingArguments(
    output_dir                ="./best_rerun_model_bertbase",
    learning_rate             = best_hp["learning_rate"],
    per_device_train_batch_size=best_hp["per_device_train_batch_size"],
    per_device_eval_batch_size =best_hp["per_device_train_batch_size"],
    num_train_epochs          = best_hp["num_train_epochs"],
    weight_decay              = best_hp["weight_decay_hyperparam"],
    warmup_ratio              = best_hp["warmup_ratio"],
    #lr_scheduler_type         = best_hp.get("lr_scheduler_type","linear"),

    eval_strategy             ="epoch",
    save_strategy             ="epoch",
    logging_strategy          ="epoch",
    load_best_model_at_end    = True,
    metric_for_best_model     = "eval_f1",
    greater_is_better         = True,

    report_to                 = ["wandb"]
)

# ─── 4) Trainer with F₁‐only EarlyStopping ──────────────────────────────────
trainer = Trainer(
    model           = model,
    args            = args,
    train_dataset   = hf_train,
    eval_dataset    = hf_val,
    compute_metrics = compute_metrics,
    callbacks       = [EarlyStoppingCallback(early_stopping_patience=2)]
)

# ─── 5) Train & Save ────────────────────────────────────────────────────────
trainer.train()
trainer.save_model("./best_rerun_model_bertbase")

wandb.finish()

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at Maltehb/danish-bert-botxo and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at Maltehb/danish-bert-botxo and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,F1,Accuracy
1,0.7177,0.67528,0.431998,0.55914
2,0.601,0.628658,0.610591,0.634409
3,0.4875,0.617318,0.687594,0.688172
4,0.276,0.643011,0.688517,0.698925
5,0.1261,0.727465,0.685847,0.688172
6,0.0522,0.71956,0.739496,0.741935
7,0.0335,0.807686,0.731058,0.731183


0,1
eval/accuracy,▁▄▆▆▆██
eval/f1,▁▅▇▇▇██
eval/loss,▃▁▁▂▅▅█
eval/runtime,▁█▂▅▄▆▃
eval/samples_per_second,█▁▇▄▅▃▅
eval/steps_per_second,█▁▇▄▅▃▅
train/epoch,▁▁▂▂▃▃▅▅▆▆▇▇███
train/global_step,▁▁▂▂▃▃▅▅▆▆▇▇███
train/grad_norm,▅█▆▃▁▁▁
train/learning_rate,█▇▆▄▃▂▁

0,1
eval/accuracy,0.73118
eval/f1,0.73106
eval/loss,0.80769
eval/runtime,3.354
eval/samples_per_second,27.728
eval/steps_per_second,4.77
total_flos,677774078607360.0
train/epoch,7.0
train/global_step,434.0
train/grad_norm,1.82838


In [None]:
# ─── 8) Evaluate on hf_test ────────────────────────────────────────────────────
import numpy as np
import pandas as pd
import os
from sklearn.metrics import classification_report, confusion_matrix
from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments

# 8a) Load the model
model = AutoModelForSequenceClassification.from_pretrained("./best_sweep_model") # uploaded from disk

# 8b) Prepare the Trainer for evaluation
test_args = TrainingArguments(
    output_dir                ="./results_test_pretrained_bertbase_load_in",
    per_device_eval_batch_size=16,
    report_to=[]
)
test_trainer = Trainer(
    model           = model,
    args            = test_args,
    compute_metrics = compute_metrics  # as defined earlier
)

# 8c) Run prediction → returns PredictionOutput
test_output = test_trainer.predict(hf_test)
print("→ Final Test metrics:", test_output.metrics)

# ensure output directory exists
os.makedirs(test_args.output_dir, exist_ok=True)

# 8d) Save the JSON of all test metrics
test_trainer.save_metrics("test", test_output.metrics)
print(f"Saved all_results.json in {test_args.output_dir}")

# 8e) Save raw predictions
preds = test_output.predictions
np.save(os.path.join(test_args.output_dir, "test_predictions.npy"), preds)
print(f"Saved test_predictions.npy in {test_args.output_dir}")

# 8f) Classification report → CSV
y_true = df_test["label_binary"].to_numpy()
y_pred = np.argmax(preds, axis=-1)
clf_dict = classification_report(y_true, y_pred, output_dict=True)
print(classification_report(y_true, y_pred))

pd.DataFrame(clf_dict).T.to_csv(
    os.path.join(test_args.output_dir, "classification_report_pretrained_bertbase_load_in.csv")
)
print("Saved classification_report_pretrained_bertbase_load_in.csv")

# 8g) Confusion matrix → CSV
cm = confusion_matrix(y_true, y_pred)
print("Confusion matrix:\n", cm)
cm_df = pd.DataFrame(
    cm,
    index=["true_0", "true_1"],
    columns=["pred_0", "pred_1"]
)
cm_df.to_csv(
    os.path.join(test_args.output_dir, "confusion_matrix_bertbase_load_in.csv")
)
print("Saved confusion_matrix_bertbase_load_in.csv")


→ Final Test metrics: {'test_loss': 0.9354375004768372, 'test_model_preparation_time': 0.0028, 'test_accuracy': 0.6818181818181818, 'test_eval_f1': 0.674342105263158, 'test_runtime': 4.39, 'test_samples_per_second': 30.068, 'test_steps_per_second': 2.05}
Saved all_results.json in ./results_test_pretrained_bertbase_load_in
Saved test_predictions.npy in ./results_test_pretrained_bertbase_load_in
              precision    recall  f1-score   support

           0       0.73      0.71      0.72        77
           1       0.61      0.64      0.62        55

    accuracy                           0.68       132
   macro avg       0.67      0.68      0.67       132
weighted avg       0.68      0.68      0.68       132

Saved classification_report_pretrained_bertbase_load_in.csv
Confusion matrix:
 [[55 22]
 [20 35]]
Saved confusion_matrix_bertbase_load_in.csv


### C.2 BERT-BASE (baseline): Retrain

In [None]:
# Check the checkpoint
model_checkpoint

'Maltehb/danish-bert-botxo'

In [None]:
from transformers import (
    Trainer,
    TrainingArguments,
    TrainerCallback,
    AutoConfig,
    AutoModelForSequenceClassification
)
import numpy as np
import pandas as pd
from datasets import concatenate_datasets
from sklearn.metrics import accuracy_score, f1_score

# 1) Combine train + val
hf_trainval = concatenate_datasets([hf_train, hf_val])

# 2) Use the same best_hp from sweep‐selection step
# e.g. best_hp = { "learning_rate": ..., "per_device_train_batch_size": ..., ... }
print("Re-training with hyperparameters:", best_hp)

# 3) Initialize a fresh model from the original pretrained_ALL checkpoint (defined earlier)
config = AutoConfig.from_pretrained(model_checkpoint, num_labels=2)
model  = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, config=config)

# 4) Define a training‐only metrics function
def compute_metrics_train(p):
    preds = np.argmax(p.predictions, axis=1)
    return {
        "train_accuracy": accuracy_score(p.label_ids, preds),
        "train_f1":       f1_score(p.label_ids, preds, average="weighted"),
        # Note: Trainer logs train_loss automatically in p.metrics
    }

# 5) Callback to log per‐epoch training metrics
class TrainLoggingCallback(TrainerCallback):
    def __init__(self):
        self.history = []
    def on_epoch_end(self, args, state, control, **kwargs):
        m = self.trainer.evaluate(self.trainer.train_dataset, metric_key_prefix="train")
        self.history.append(dict(epoch=state.epoch, **m))
        print(
            f"Epoch {int(state.epoch)} | "
            f"loss {m['train_loss']:.4f} "
            f"acc  {m['train_accuracy']:.4f} "
            f"f1   {m['train_f1']:.4f}"
        )

cb = TrainLoggingCallback()

# 6) Set up and run Trainer without any eval_dataset
trainer = Trainer(
    model           = model,
    args            = TrainingArguments(
        output_dir                 = "./best_final_retrained_bertbase_model",
        learning_rate              = best_hp["learning_rate"],
        per_device_train_batch_size= best_hp["per_device_train_batch_size"],
        num_train_epochs           = best_hp["num_train_epochs"],
        weight_decay               = best_hp["weight_decay_hyperparam"],
        warmup_ratio               = best_hp["warmup_ratio"],
        #lr_scheduler_type          = best_hp.get("lr_scheduler_type","linear"), # not used atm

        eval_strategy              = "no",    # skip validation
        logging_strategy           = "no",
        save_strategy              = "epoch",
        report_to                  = []       # turn off W&B
    ),
    train_dataset   = hf_trainval,
    compute_metrics = compute_metrics_train,
    callbacks       = [cb]
)
cb.trainer = trainer

trainer.train() # While this runs it shows a table incl the header Validation loss - this is just the generic table-setting from trainer, but it is train loss
trainer.save_model("./best_final_retrained_bertbase_model")  # writes config.json, pytorch_model.bin, etc.

# 7) Inspect & save the training history
train_df = pd.DataFrame(cb.history).set_index("epoch")
train_df.to_csv("train_history_bertbase_retrained.csv")
print(train_df)


  block_group = [InMemoryTable(cls._concat_blocks(list(block_group), axis=axis))]
  table = cls._concat_blocks(blocks, axis=0)


Re-training with hyperparameters: {'learning_rate': 2.6217449527811408e-05, 'per_device_train_batch_size': 6, 'num_train_epochs': 7, 'weight_decay_hyperparam': 0.2024789742475012, 'warmup_ratio': 0.06}


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at Maltehb/danish-bert-botxo and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss,Validation Loss,Accuracy,F1
77,No log,0.558217,0.733189,0.713968
154,No log,0.431699,0.81128,0.810399
231,No log,0.391525,0.813449,0.810179
308,No log,0.056042,0.986985,0.986975
385,No log,0.023354,0.993492,0.993491
462,No log,0.012218,0.997831,0.99783
539,No log,0.010334,0.997831,0.99783


Epoch 1 | loss 0.5582 acc  0.7332 f1   0.7140
Epoch 2 | loss 0.4317 acc  0.8113 f1   0.8104
Epoch 3 | loss 0.3915 acc  0.8134 f1   0.8102
Epoch 4 | loss 0.0560 acc  0.9870 f1   0.9870
Epoch 5 | loss 0.0234 acc  0.9935 f1   0.9935
Epoch 6 | loss 0.0122 acc  0.9978 f1   0.9978
Epoch 7 | loss 0.0103 acc  0.9978 f1   0.9978
       train_accuracy  train_f1  train_loss
epoch                                      
1.0          0.733189  0.713968    0.558217
2.0          0.811280  0.810399    0.431699
3.0          0.813449  0.810179    0.391525
4.0          0.986985  0.986975    0.056042
5.0          0.993492  0.993491    0.023354
6.0          0.997831  0.997830    0.012218
7.0          0.997831  0.997830    0.010334


In [None]:
# ─── 8) Evaluate Retrained Model on hf_test ─────────────────────────
import os
import numpy as np
import pandas as pd
from sklearn.metrics import classification_report, confusion_matrix
from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments

# 8a) Load the retrained checkpoint
model = AutoModelForSequenceClassification.from_pretrained(
    "./best_final_retrained_bertbase_model"
)

# 8b) Prepare the Trainer for evaluation
test_args = TrainingArguments(
    output_dir                ="./results_test_bertbase_retrained",
    per_device_eval_batch_size=16,
    report_to=[]
)
test_trainer = Trainer(
    model           = model,
    args            = test_args,
    compute_metrics = compute_metrics    # the original compute_metrics fn
)

# 8c) Run prediction → returns PredictionOutput
test_output = test_trainer.predict(hf_test)
print("→ Final Retrained Test metrics:", test_output.metrics)

# ensure output directory exists
os.makedirs(test_args.output_dir, exist_ok=True)

# 8d) Save the JSON of all test metrics
test_trainer.save_metrics("test", test_output.metrics)
print(f"Saved all_results.json in {test_args.output_dir}")

# 8e) Save raw predictions
preds = test_output.predictions
np.save(os.path.join(test_args.output_dir, "test_predictions.npy"), preds)
print(f"Saved test_predictions.npy in {test_args.output_dir}")

# 8f) Classification report → CSV
y_true = df_test["label_binary"].to_numpy()
y_pred = np.argmax(preds, axis=-1)
print(classification_report(y_true, y_pred))
clf_dict = classification_report(y_true, y_pred, output_dict=True)
pd.DataFrame(clf_dict).T.to_csv(
    os.path.join(test_args.output_dir, "classification_report_bertbase_retrained.csv")
)
print("Saved classification_report_bertbase_retrained.csv")

# 8g) Confusion matrix → CSV
cm = confusion_matrix(y_true, y_pred)
print("Confusion matrix:\n", cm)
cm_df = pd.DataFrame(
    cm,
    index  =["true_0", "true_1"],
    columns=["pred_0", "pred_1"]
)
cm_df.to_csv(
    os.path.join(test_args.output_dir, "confusion_matrix_bertbase_retrained_.csv")
)
print("Saved confusion_matrix_bertbase_retrained_.csv")


→ Final Retrained Test metrics: {'test_loss': 1.225040078163147, 'test_model_preparation_time': 0.0036, 'test_accuracy': 0.6818181818181818, 'test_eval_f1': 0.6770736253494875, 'test_runtime': 4.1367, 'test_samples_per_second': 31.909, 'test_steps_per_second': 2.176}
Saved all_results.json in ./results_test_bertbase_retrained
Saved test_predictions.npy in ./results_test_bertbase_retrained
              precision    recall  f1-score   support

           0       0.75      0.69      0.72        77
           1       0.61      0.67      0.64        55

    accuracy                           0.68       132
   macro avg       0.68      0.68      0.68       132
weighted avg       0.69      0.68      0.68       132

Saved classification_report_bertbase_retrained.csv
Confusion matrix:
 [[53 24]
 [18 37]]
Saved confusion_matrix_bertbase_retrained_.csv


# DUAL-CRITERION MODELS (FINE-TUNING AND TESTING)

In [None]:
# Prepare if kernel is dead pt 1
!pip install "numpy<2.0" # run + restart kernel


Collecting numpy<2.0
  Downloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/61.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.3/18.3 MB[0m [31m114.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 2.0.2
    Uninstalling numpy-2.0.2:
      Successfully uninstalled numpy-2.0.2
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
thinc 8.3.6 requires numpy<3.0.0,>=2.0.0, but you h

In [None]:
# Prepare if kernel is dead pt 2
# Get in data files (split and saved earlier)
!pip install transformers datasets evaluate wandb -q

import pandas as pd
import torch
import numpy as np
from datasets import Dataset
from transformers import (
    AutoTokenizer, AutoModelForSequenceClassification,
    Trainer, TrainingArguments, EarlyStoppingCallback
)
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import StratifiedShuffleSplit
import evaluate
import wandb

print("💻 Device in use:", torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU only")

df_train = pd.read_csv("/content/df_train.csv")
df_val   = pd.read_csv("/content/df_val.csv")
df_test  = pd.read_csv("/content/df_test.csv")


[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/84.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[?25h💻 Device in use: Tesla T4


### 1 PRETRAINED ALL

In [None]:
# Loading in the things I need
model_checkpoint = "./danish-bert-adapted/danish-bert-adapted"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

def tokenize(batch):
    return tokenizer(batch["input_text"],
                     truncation=True,
                     padding="max_length",
                     max_length=512)


all_cols = ['input_text','label_binary']
hf_train = Dataset.from_pandas(df_train[all_cols].reset_index(drop=True))
hf_val   = Dataset.from_pandas(df_val  [all_cols].reset_index(drop=True))
hf_test  = Dataset.from_pandas(df_test [all_cols].reset_index(drop=True))

# map, rename, and rebind each split
hf_train = (
    hf_train
    .map(tokenize, batched=True)
    .rename_column("label_binary", "labels")
)
hf_val = (
    hf_val
    .map(tokenize, batched=True)
    .rename_column("label_binary", "labels")
)
hf_test = (
    hf_test
    .map(tokenize, batched=True)
    .rename_column("label_binary", "labels")
)

# now explicitly set the columns wanted returned
hf_train.set_format("torch", columns=["input_ids", "attention_mask", "labels"])
hf_val  .set_format("torch", columns=["input_ids", "attention_mask", "labels"])
hf_test .set_format("torch", columns=["input_ids", "attention_mask", "labels"])


# ─── 7) Metrics & callback ────────────────────────

accuracy = evaluate.load("accuracy")
f1       = evaluate.load("f1")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    return {
        "accuracy": accuracy.compute(predictions=preds, references=labels)["accuracy"],
        "f1":       f1.compute(predictions=preds, references=labels, average="macro")["f1"]
    }

from transformers import TrainerCallback
class TrainMetricsCallback(TrainerCallback):
    def __init__(self, trainer=None):
        self.trainer = trainer

        ## We use predict() here on the *training* set purely for logging.
        # WARNING: Hugging Face’s WandB integration will log these under "test/…"
        # even though this is training-data performance, not true test-set metrics.
    def on_epoch_end(self, args, state, control, **kwargs):
        if not self.trainer: return
        pred = self.trainer.predict(self.trainer.train_dataset)
        p = np.argmax(pred.predictions, axis=-1)
        l = pred.label_ids
        wandb.log({
            "train/accuracy": accuracy.compute(predictions=p, references=l)["accuracy"],
            "train/f1":       f1.compute(predictions=p, references=l, average="macro")["f1"],
            "train/loss":     pred.metrics["test_loss"],
            "epoch":          state.epoch
        })

Map:   0%|          | 0/368 [00:00<?, ? examples/s]

Map:   0%|          | 0/93 [00:00<?, ? examples/s]

Map:   0%|          | 0/132 [00:00<?, ? examples/s]

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/6.79k [00:00<?, ?B/s]

#### 1.1 PRETRAINED ALL TESTING: LOAD IN

In [None]:
# ----- Identifying the best dual-criterion run
import wandb
import numpy as np

# 1) Login & point at sweep
wandb.login()
api      = wandb.Api()
project  = "pernillebrams/danish-bert-answer-all-pretrained-all-binary"
sweep_id = "xrjzmicm"
sweep    = api.sweep(f"{project}/{sweep_id}")

candidates = []
for run in sweep.runs:
    summary = run.summary
    f1 = summary.get("eval/f1", None)
    if f1 is None:
        continue  # no F1 logged

    # 2) pull down the eval/loss history
    hist = run.history(keys=["eval/loss"], pandas=False)  # a list of dicts
    losses = [row["eval/loss"] for row in hist if row.get("eval/loss") is not None]

    if len(losses) < 2:
        continue  # not enough points to establish a trend

    # 3) checking if loss is trending downward
    #    here we simply compare the first vs last logged loss
    if losses[-1] < losses[0]:
        candidates.append((run, f1, losses[0], losses[-1]))

# 4) picking the winner with the highest eval/f1
if not candidates:
    print("No runs found with decreasing eval/loss.")
else:
    best_run, best_f1, start_loss, end_loss = max(candidates, key=lambda x: x[1])
    print(f"Selected run: {best_run.id}, sweep name: {sweep.name}, run name: {best_run.name}")
    print(f"  eval/f1 = {best_f1:.4f}")
    print(f"  eval/loss: {start_loss:.4f} → {end_loss:.4f}  (downward trend)")

    print("\nHyperparameters:")
    for hp in ["learning_rate","per_device_train_batch_size","num_train_epochs",
               "weight_decay_hyperparam","warmup_ratio"]:
        print(f"  {hp}: {best_run.config.get(hp)}")

# 4) extracting the hyperparameters
hp_keys = [
    "learning_rate",
    "per_device_train_batch_size",
    "num_train_epochs",
    "weight_decay_hyperparam",
    "warmup_ratio",
    #"lr_scheduler_type",
    #"dropout"
]

best_hp = {k: best_run.config[k] for k in hp_keys}
print("Best hyperparameters extracted:", best_hp)

Selected run: wliw6wj9, sweep name: xrjzmicm, run name: ethereal-sweep-17
  eval/f1 = 0.7732
  eval/loss: 0.6645 → 0.5549  (downward trend)

Hyperparameters:
  learning_rate: 1.0390934869072987e-05
  per_device_train_batch_size: 6
  num_train_epochs: 8
  weight_decay_hyperparam: 0.07816342480365049
  warmup_ratio: 0.1
Best hyperparameters extracted: {'learning_rate': 1.0390934869072987e-05, 'per_device_train_batch_size': 6, 'num_train_epochs': 8, 'weight_decay_hyperparam': 0.07816342480365049, 'warmup_ratio': 0.1}


In [None]:
# -- Rerunning the exact sweep run, but with dual-monitoring early stopping
# ─── 2) Metrics fn ─────────────────────────────────────────────────────────────
accuracy = evaluate.load("accuracy")
f1       = evaluate.load("f1")
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    return {
        "accuracy": accuracy.compute(predictions=preds, references=labels)["accuracy"],
        "f1":       f1.compute(predictions=preds, references=labels, average="macro")["f1"]
    }

# ─── 3) Best hyperparameters (retrieved from above dynamically) ────────────────────────────────────────────────────
# In this: best_hp

# ─── 4) WandB init (for logging) ──────────────────────────────────────
wandb.init(
    project="danish-bert-answer-all-pretrained-all-binary",
    entity="pernillebrams",
    name="manual-replay-best-pretrained-all-dual",
    config=best_hp,
    reinit=True
)

# ─── 5) Build model & TrainingArguments ────────────────────────────────────────
config = AutoModelForSequenceClassification.from_pretrained(model_checkpoint).config
config.num_labels = 2
#config.hidden_dropout_prob          = best_hp["dropout"]
#config.attention_probs_dropout_prob = best_hp["dropout"]

model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, config=config)

from transformers import TrainerCallback

# Make a dual trainer early stop setting
class LossAndF1EarlyStop(TrainerCallback):
    def __init__(self, loss_patience=1, f1_patience=2):
        self.loss_patience, self.f1_patience = loss_patience, f1_patience
        self.best_loss, self.loss_wait = float('inf'), 0
        self.best_f1, self.f1_wait   = 0, 0

    def on_evaluate(self, args, state, control, metrics=None, **_):
        loss = metrics["eval_loss"]
        f1   = metrics["eval_f1"]
        # track loss
        if loss < self.best_loss:
            self.best_loss, self.loss_wait = loss, 0
        else:
            self.loss_wait += 1
        # track f1
        if f1 > self.best_f1:
            self.best_f1, self.f1_wait = f1, 0
        else:
            self.f1_wait += 1
        # stop if either has stalled
        if self.loss_wait > self.loss_patience or self.f1_wait > self.f1_patience:
            control.should_training_stop = True
        return control


args = TrainingArguments(
    output_dir                ="./best_replay_model_dual",
    learning_rate             =best_hp["learning_rate"],
    per_device_train_batch_size=best_hp["per_device_train_batch_size"],
    per_device_eval_batch_size =best_hp["per_device_train_batch_size"],
    num_train_epochs          =best_hp["num_train_epochs"],
    weight_decay              =best_hp["weight_decay_hyperparam"],
    warmup_ratio              =best_hp["warmup_ratio"],
    #lr_scheduler_type         =best_hp["lr_scheduler_type"], # not used atm

    eval_strategy             ="epoch",
    save_strategy             ="epoch",
    logging_strategy          ="epoch",
    load_best_model_at_end    =True,
    metric_for_best_model     ="eval_f1",
    greater_is_better         =True,

    report_to=["wandb"]
)

# ─── 6) Trainer & EarlyStopping (exactly like sweep) ───────────────────────────
trainer = Trainer(
    model           =model,
    args            =args,
    train_dataset   =hf_train,
    eval_dataset    =hf_val,
    compute_metrics =compute_metrics,
    #callbacks       =[EarlyStoppingCallback(early_stopping_patience=2)]
    callbacks        =[LossAndF1EarlyStop(loss_patience=1, f1_patience=2)] # guarantee halt and pick checkpoint before either metric shows true over‐fitting, rather than only watching F
)

# ─── 7) Run & save ─────────────────────────────────────────────────────────────
trainer.train()
trainer.save_model("./best_replay_model_dual")  # checkpoint files go here

wandb.finish()

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at ./danish-bert-adapted/danish-bert-adapted and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at ./danish-bert-adapted/danish-bert-adapted and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.6862,0.658643,0.569892,0.478114
2,0.5971,0.60763,0.666667,0.633439
3,0.4837,0.569128,0.72043,0.714858
4,0.367,0.562416,0.741935,0.739496
5,0.2871,0.592117,0.731183,0.729179
6,0.2087,0.642314,0.741935,0.736792


0,1
eval/accuracy,▁▅▇███
eval/f1,▁▅▇███
eval/loss,█▄▁▁▃▇
eval/runtime,▂▁▆▆▇█
eval/samples_per_second,▇█▃▃▂▁
eval/steps_per_second,▇█▃▃▂▁
train/epoch,▁▁▂▂▄▄▅▅▇▇███
train/global_step,▁▁▂▂▄▄▅▅▇▇███
train/grad_norm,▂▅▃▁▁█
train/learning_rate,█▇▅▄▂▁

0,1
eval/accuracy,0.74194
eval/f1,0.73679
eval/loss,0.64231
eval/runtime,3.4192
eval/samples_per_second,27.2
eval/steps_per_second,4.679
total_flos,580949210234880.0
train/epoch,6.0
train/global_step,372.0
train/grad_norm,35.12283


In [None]:
# ─── 8) Evaluate on hf_test ────────────────────────────────────────────────────
import numpy as np
import pandas as pd
import os
from sklearn.metrics import classification_report, confusion_matrix
from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments

# 8a) Load the dual-stopped model
model = AutoModelForSequenceClassification.from_pretrained("./best_replay_model_dual")

# 8b) Prepare the Trainer for evaluation
test_args = TrainingArguments(
    output_dir                ="./results_test_replay_dual",
    per_device_eval_batch_size=16,
    report_to=[]
)
test_trainer = Trainer(
    model           = model,
    args            = test_args,
    compute_metrics = compute_metrics  # as defined earlier
)

# 8c) Run prediction → returns PredictionOutput
test_output = test_trainer.predict(hf_test)
print("→ Final Test metrics:", test_output.metrics)

# ensure output directory exists
os.makedirs(test_args.output_dir, exist_ok=True)

# 8d) Save the JSON of all test metrics
test_trainer.save_metrics("test", test_output.metrics)
print(f"Saved all_results.json in {test_args.output_dir}")

# 8e) Save raw predictions
preds = test_output.predictions
np.save(os.path.join(test_args.output_dir, "test_predictions.npy"), preds)
print(f"Saved test_predictions.npy in {test_args.output_dir}")

# 8f) Classification report → CSV
y_true = df_test["label_binary"].to_numpy()
y_pred = np.argmax(preds, axis=-1)
clf_dict = classification_report(y_true, y_pred, output_dict=True)
print(classification_report(y_true, y_pred))

pd.DataFrame(clf_dict).T.to_csv(
    os.path.join(test_args.output_dir, "classification_report_replay_dual.csv")
)
print("Saved classification_report_replay_dual.csv")

# 8g) Confusion matrix → CSV
cm = confusion_matrix(y_true, y_pred)
print("Confusion matrix:\n", cm)
cm_df = pd.DataFrame(
    cm,
    index=["true_0", "true_1"],
    columns=["pred_0", "pred_1"]
)
cm_df.to_csv(
    os.path.join(test_args.output_dir, "confusion_matrix_replay_dual.csv")
)
print("Saved confusion_matrix_replay_dual.csv")


→ Final Test metrics: {'test_loss': 0.5701903700828552, 'test_model_preparation_time': 0.0055, 'test_accuracy': 0.6742424242424242, 'test_f1': 0.6699808128379556, 'test_runtime': 4.5924, 'test_samples_per_second': 28.743, 'test_steps_per_second': 1.96}
Saved all_results.json in ./results_test_replay_dual
Saved test_predictions.npy in ./results_test_replay_dual
              precision    recall  f1-score   support

           0       0.74      0.68      0.71        77
           1       0.60      0.67      0.63        55

    accuracy                           0.67       132
   macro avg       0.67      0.67      0.67       132
weighted avg       0.68      0.67      0.68       132

Saved classification_report_replay_dual.csv
Confusion matrix:
 [[52 25]
 [18 37]]
Saved confusion_matrix_replay_dual.csv


#### 1.2 PRETRAINED ALL TESTING: RETRAIN

In [None]:
# Checking the checkpoint
model_checkpoint

'./danish-bert-adapted/danish-bert-adapted'

In [None]:
from transformers import (
    Trainer,
    TrainingArguments,
    TrainerCallback,
    AutoConfig,
    AutoModelForSequenceClassification
)
import numpy as np
import pandas as pd
from datasets import concatenate_datasets
from sklearn.metrics import accuracy_score, f1_score

# 1) Combine train + val
hf_trainval = concatenate_datasets([hf_train, hf_val])

# 2) Use the same best_hp from sweep‐selection step
print("Re-training with hyperparameters:", best_hp)

# 3) Initialize a fresh model from the original pretrained_ALL checkpoint (defined earlier)
config = AutoConfig.from_pretrained(model_checkpoint, num_labels=2)
model  = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, config=config)

# 4) Define a training‐only metrics function
def compute_metrics_train(p):
    preds = np.argmax(p.predictions, axis=1)
    return {
        "train_accuracy": accuracy_score(p.label_ids, preds),
        "train_f1":       f1_score(p.label_ids, preds, average="weighted"),
        # Note: Trainer logs train_loss automatically in p.metrics
    }

# 5) callback to log per‐epoch training metrics
class TrainLoggingCallback(TrainerCallback):
    def __init__(self):
        self.history = []
    def on_epoch_end(self, args, state, control, **kwargs):
        m = self.trainer.evaluate(self.trainer.train_dataset, metric_key_prefix="train")
        self.history.append(dict(epoch=state.epoch, **m))
        print(
            f"Epoch {int(state.epoch)} | "
            f"loss {m['train_loss']:.4f} "
            f"acc  {m['train_accuracy']:.4f} "
            f"f1   {m['train_f1']:.4f}"
        )

cb = TrainLoggingCallback()

# 6) Set up and run Trainer without any eval_dataset
trainer = Trainer(
    model           = model,
    args            = TrainingArguments(
        output_dir                 = "./best_final_dual_retrained_model",
        learning_rate              = best_hp["learning_rate"],
        per_device_train_batch_size= best_hp["per_device_train_batch_size"],
        num_train_epochs           = best_hp["num_train_epochs"],
        weight_decay               = best_hp["weight_decay_hyperparam"],
        warmup_ratio               = best_hp["warmup_ratio"],
        #lr_scheduler_type          = best_hp.get("lr_scheduler_type","linear"), # not used atm

        eval_strategy              = "no",    # skip validation
        logging_strategy           = "no",
        save_strategy              = "epoch",
        report_to                  = []       # turn off W&B
    ),
    train_dataset   = hf_trainval,
    compute_metrics = compute_metrics_train,
    callbacks       = [cb]
)
cb.trainer = trainer

trainer.train() # While this runs it shows a table incl the header Validation loss - this is just the generic table-setting from trainer, but it is train loss
trainer.save_model("./best_final_dual_retrained_model")  # writes config.json, pytorch_model.bin, etc.

# 7) Inspect & save the training history
train_df = pd.DataFrame(cb.history).set_index("epoch")
train_df.to_csv("train_history_dual_retrained.csv")
print(train_df)


  block_group = [InMemoryTable(cls._concat_blocks(list(block_group), axis=axis))]
  table = cls._concat_blocks(blocks, axis=0)
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at ./danish-bert-adapted/danish-bert-adapted and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Re-training with hyperparameters: {'learning_rate': 1.0390934869072987e-05, 'per_device_train_batch_size': 6, 'num_train_epochs': 8, 'weight_decay_hyperparam': 0.07816342480365049, 'warmup_ratio': 0.1}


Step,Training Loss,Validation Loss,Accuracy,F1
77,No log,0.588822,0.70282,0.673504
154,No log,0.439967,0.856833,0.856974
231,No log,0.318923,0.882863,0.882979
308,No log,0.195731,0.94577,0.945796
385,No log,0.120005,0.9718,0.971795
462,No log,0.081491,0.986985,0.986989
539,No log,0.056443,0.991323,0.991323
616,No log,0.053936,0.989154,0.989156


Epoch 1 | loss 0.5888 acc  0.7028 f1   0.6735
Epoch 2 | loss 0.4400 acc  0.8568 f1   0.8570
Epoch 3 | loss 0.3189 acc  0.8829 f1   0.8830
Epoch 4 | loss 0.1957 acc  0.9458 f1   0.9458
Epoch 5 | loss 0.1200 acc  0.9718 f1   0.9718
Epoch 6 | loss 0.0815 acc  0.9870 f1   0.9870
Epoch 7 | loss 0.0564 acc  0.9913 f1   0.9913
Epoch 8 | loss 0.0539 acc  0.9892 f1   0.9892
       train_accuracy  train_f1  train_loss
epoch                                      
1.0          0.702820  0.673504    0.588822
2.0          0.856833  0.856974    0.439967
3.0          0.882863  0.882979    0.318923
4.0          0.945770  0.945796    0.195731
5.0          0.971800  0.971795    0.120005
6.0          0.986985  0.986989    0.081491
7.0          0.991323  0.991323    0.056443
8.0          0.989154  0.989156    0.053936


In [None]:
# ─── 8) Evaluate Retrained Dual-Stopped Model on hf_test ─────────────────────────
import os
import numpy as np
import pandas as pd
from sklearn.metrics import classification_report, confusion_matrix
from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments

# 8a) Load the retrained checkpoint
model = AutoModelForSequenceClassification.from_pretrained(
    "./best_final_dual_retrained_model"
)

# 8b) Prepare the Trainer for evaluation
test_args = TrainingArguments(
    output_dir                ="./results_test_retrained_dual",
    per_device_eval_batch_size=16,
    report_to=[]
)
test_trainer = Trainer(
    model           = model,
    args            = test_args,
    compute_metrics = compute_metrics    # original compute_metrics fn
)

# 8c) Run prediction → returns PredictionOutput
test_output = test_trainer.predict(hf_test)
print("→ Final Retrained Test metrics:", test_output.metrics)

# ensure output directory exists
os.makedirs(test_args.output_dir, exist_ok=True)

# 8d) Save the JSON of all test metrics
test_trainer.save_metrics("test", test_output.metrics)
print(f"Saved all_results.json in {test_args.output_dir}")

# 8e) Save raw predictions
preds = test_output.predictions
np.save(os.path.join(test_args.output_dir, "test_predictions.npy"), preds)
print(f"Saved test_predictions.npy in {test_args.output_dir}")

# 8f) Classification report → CSV
y_true = df_test["label_binary"].to_numpy()
y_pred = np.argmax(preds, axis=-1)
print(classification_report(y_true, y_pred))
clf_dict = classification_report(y_true, y_pred, output_dict=True)
pd.DataFrame(clf_dict).T.to_csv(
    os.path.join(test_args.output_dir, "classification_report_retrained_dual.csv")
)
print("Saved classification_report_retrained_dual.csv")

# 8g) Confusion matrix → CSV
cm = confusion_matrix(y_true, y_pred)
print("Confusion matrix:\n", cm)
cm_df = pd.DataFrame(
    cm,
    index  =["true_0", "true_1"],
    columns=["pred_0", "pred_1"]
)
cm_df.to_csv(
    os.path.join(test_args.output_dir, "confusion_matrix_retrained_dual.csv")
)
print("Saved confusion_matrix_retrained_dual.csv")


→ Final Retrained Test metrics: {'test_loss': 0.5554364323616028, 'test_model_preparation_time': 0.0029, 'test_accuracy': 0.7424242424242424, 'test_f1': 0.738583410997204, 'test_runtime': 4.4478, 'test_samples_per_second': 29.677, 'test_steps_per_second': 2.023}
              precision    recall  f1-score   support

           0       0.80      0.74      0.77        77
           1       0.67      0.75      0.71        55

    accuracy                           0.74       132
   macro avg       0.74      0.74      0.74       132
weighted avg       0.75      0.74      0.74       132

Saved classification_report_retrained_dual.csv
Confusion matrix:
 [[57 20]
 [14 41]]
Saved confusion_matrix_retrained_dual.csv


### 2 PLDQA

In [None]:
# Loading in the things I need
model_checkpoint = "./danish-bert-adapted-pldqa/danish-bert-adapted-pldqa"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

def tokenize(batch):
    return tokenizer(batch["input_text"],
                     truncation=True,
                     padding="max_length",
                     max_length=512)


all_cols = ['input_text','label_binary']
hf_train = Dataset.from_pandas(df_train[all_cols].reset_index(drop=True))
hf_val   = Dataset.from_pandas(df_val  [all_cols].reset_index(drop=True))
hf_test  = Dataset.from_pandas(df_test [all_cols].reset_index(drop=True))

# map, rename, and rebind each split
hf_train = (
    hf_train
    .map(tokenize, batched=True)
    .rename_column("label_binary", "labels")
)
hf_val = (
    hf_val
    .map(tokenize, batched=True)
    .rename_column("label_binary", "labels")
)
hf_test = (
    hf_test
    .map(tokenize, batched=True)
    .rename_column("label_binary", "labels")
)

# now explicitly set the columns want returned
hf_train.set_format("torch", columns=["input_ids", "attention_mask", "labels"])
hf_val  .set_format("torch", columns=["input_ids", "attention_mask", "labels"])
hf_test .set_format("torch", columns=["input_ids", "attention_mask", "labels"])


# ─── 7) Metrics & callback ────────────────────────
accuracy = evaluate.load("accuracy")
f1       = evaluate.load("f1")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    return {
        "accuracy": accuracy.compute(predictions=preds, references=labels)["accuracy"],
        "f1":       f1.compute(predictions=preds, references=labels, average="macro")["f1"]
    }

from transformers import TrainerCallback
class TrainMetricsCallback(TrainerCallback):
    def __init__(self, trainer=None):
        self.trainer = trainer

        ## We use predict() here on the *training* set purely for logging.
        # WARNING: Hugging Face’s WandB integration will log these under "test/…"
        # even though this is training-data performance, not true test-set metrics.
    def on_epoch_end(self, args, state, control, **kwargs):
        if not self.trainer: return
        pred = self.trainer.predict(self.trainer.train_dataset)
        p = np.argmax(pred.predictions, axis=-1)
        l = pred.label_ids
        wandb.log({
            "train/accuracy": accuracy.compute(predictions=p, references=l)["accuracy"],
            "train/f1":       f1.compute(predictions=p, references=l, average="macro")["f1"],
            "train/loss":     pred.metrics["test_loss"],
            "epoch":          state.epoch
        })

Map:   0%|          | 0/368 [00:00<?, ? examples/s]

Map:   0%|          | 0/93 [00:00<?, ? examples/s]

Map:   0%|          | 0/132 [00:00<?, ? examples/s]

#### 2.1 PLDQA TESTING: LOAD IN

In [None]:
# ----- Identifying the best dual-criterion run
import wandb
import numpy as np

# 1) Login & point at sweep
wandb.login()
api      = wandb.Api()
project  = "pernillebrams/danish-bert-answer-pldqa-pretrained-pldqa-binary"
sweep_id = "8ouy284v"
sweep    = api.sweep(f"{project}/{sweep_id}")

candidates = []
for run in sweep.runs:
    summary = run.summary
    f1 = summary.get("eval/f1", None)
    if f1 is None:
        continue  # no F1 logged

    # 2) pull down the eval/loss history
    hist = run.history(keys=["eval/loss"], pandas=False)  # a list of dicts
    losses = [row["eval/loss"] for row in hist if row.get("eval/loss") is not None]

    if len(losses) < 2:
        continue  # not enough points to establish a trend

    # 3) checking if loss is trending downward
    #    here we simply compare the first vs last logged loss
    if losses[-1] < losses[0]:
        candidates.append((run, f1, losses[0], losses[-1]))

# 4) picking the winner with the highest eval/f1
if not candidates:
    print("No runs found with decreasing eval/loss.")
else:
    best_run, best_f1, start_loss, end_loss = max(candidates, key=lambda x: x[1])
    print(f"Selected run: {best_run.id}, sweep name: {sweep.name}, run name: {best_run.name}")
    print(f"  eval/f1 = {best_f1:.4f}")
    print(f"  eval/loss: {start_loss:.4f} → {end_loss:.4f}  (downward trend)")

    print("\nHyperparameters:")
    for hp in ["learning_rate","per_device_train_batch_size","num_train_epochs",
               "weight_decay_hyperparam","warmup_ratio"]:
        print(f"  {hp}: {best_run.config.get(hp)}")

# 4) extracting the hyperparameters
hp_keys = [
    "learning_rate",
    "per_device_train_batch_size",
    "num_train_epochs",
    "weight_decay_hyperparam",
    "warmup_ratio",
    #"lr_scheduler_type",
    #"dropout"
]

best_hp = {k: best_run.config[k] for k in hp_keys}
print("Best hyperparameters extracted:", best_hp)

Selected run: prw6he15, sweep name: 8ouy284v, run name: glorious-sweep-3
  eval/f1 = 0.7843
  eval/loss: 0.6391 → 0.5396  (downward trend)

Hyperparameters:
  learning_rate: 1.2489078867398649e-05
  per_device_train_batch_size: 6
  num_train_epochs: 7
  weight_decay_hyperparam: 0.14657288285157408
  warmup_ratio: 0
Best hyperparameters extracted: {'learning_rate': 1.2489078867398649e-05, 'per_device_train_batch_size': 6, 'num_train_epochs': 7, 'weight_decay_hyperparam': 0.14657288285157408, 'warmup_ratio': 0}


In [None]:
# -- Rerunning the exact sweep run, but with dual-monitoring early stopping
# ─── 2) Metrics fn ─────────────────────────────────────────────────────────────
accuracy = evaluate.load("accuracy")
f1       = evaluate.load("f1")
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    return {
        "accuracy": accuracy.compute(predictions=preds, references=labels)["accuracy"],
        "f1":       f1.compute(predictions=preds, references=labels, average="macro")["f1"]
    }

# ─── 3) Best hyperparameters (retrieved from above dynamically) ────────────────────────────────────────────────────
# In this: best_hp

# ─── 4) WandB init (for logging) ──────────────────────────────────────
wandb.init(
    project="danish-bert-answer-pldqa-pretrained-pldqa-binary",
    entity="pernillebrams",
    name="manual-replay-best-pretrained-pldqa-dual",
    config=best_hp,
    reinit=True
)

# ─── 5) Build model & TrainingArguments ────────────────────────────────────────
config = AutoModelForSequenceClassification.from_pretrained(model_checkpoint).config
config.num_labels = 2
#config.hidden_dropout_prob          = best_hp["dropout"]
#config.attention_probs_dropout_prob = best_hp["dropout"]

model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, config=config)

from transformers import TrainerCallback

# Make a dual trainer early stop setting
class LossAndF1EarlyStop(TrainerCallback):
    def __init__(self, loss_patience=1, f1_patience=2):
        self.loss_patience, self.f1_patience = loss_patience, f1_patience
        self.best_loss, self.loss_wait = float('inf'), 0
        self.best_f1, self.f1_wait   = 0, 0

    def on_evaluate(self, args, state, control, metrics=None, **_):
        loss = metrics["eval_loss"]
        f1   = metrics["eval_f1"]
        # track loss
        if loss < self.best_loss:
            self.best_loss, self.loss_wait = loss, 0
        else:
            self.loss_wait += 1
        # track f1
        if f1 > self.best_f1:
            self.best_f1, self.f1_wait = f1, 0
        else:
            self.f1_wait += 1
        # stop if either has stalled
        if self.loss_wait > self.loss_patience or self.f1_wait > self.f1_patience:
            control.should_training_stop = True
        return control


args = TrainingArguments(
    output_dir                ="./best_replay_model_pldqa_dual",
    learning_rate             =best_hp["learning_rate"],
    per_device_train_batch_size=best_hp["per_device_train_batch_size"],
    per_device_eval_batch_size =best_hp["per_device_train_batch_size"],
    num_train_epochs          =best_hp["num_train_epochs"],
    weight_decay              =best_hp["weight_decay_hyperparam"],
    warmup_ratio              =best_hp["warmup_ratio"],
    #lr_scheduler_type         =best_hp["lr_scheduler_type"], # not used atm

    eval_strategy             ="epoch",
    save_strategy             ="epoch",
    logging_strategy          ="epoch",
    load_best_model_at_end    =True,
    metric_for_best_model     ="eval_f1",
    greater_is_better         =True,

    report_to=["wandb"]
)

# ─── 6) Trainer & EarlyStopping (exactly like sweep) ───────────────────────────
trainer = Trainer(
    model           =model,
    args            =args,
    train_dataset   =hf_train,
    eval_dataset    =hf_val,
    compute_metrics =compute_metrics,
    #callbacks       =[EarlyStoppingCallback(early_stopping_patience=2)]
    callbacks        =[LossAndF1EarlyStop(loss_patience=1, f1_patience=2)] # guarantee halt and pick checkpoint before either metric shows true over‐fitting, rather than only watching F
)

# ─── 7) Run & save ─────────────────────────────────────────────────────────────
trainer.train()
trainer.save_model("./best_replay_model_pldqa_dual")  # checkpoint files go here

wandb.finish()

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at ./danish-bert-adapted-pldqa/danish-bert-adapted-pldqa and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at ./danish-bert-adapted-pldqa/danish-bert-adapted-pldqa and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.677,0.639316,0.634409,0.581746
2,0.5505,0.613845,0.677419,0.652293
3,0.4301,0.567406,0.698925,0.692925
4,0.3076,0.573999,0.709677,0.706281
5,0.2276,0.604547,0.72043,0.717787


0,1
eval/accuracy,▁▅▆▇█
eval/f1,▁▅▇▇█
eval/loss,█▆▁▂▅
eval/runtime,▁█▄▄▄
eval/samples_per_second,█▁▄▅▅
eval/steps_per_second,█▁▄▅▅
train/epoch,▁▁▃▃▅▅▆▆███
train/global_step,▁▁▃▃▅▅▆▆███
train/grad_norm,▄█▄▂▁
train/learning_rate,█▆▄▃▁

0,1
eval/accuracy,0.72043
eval/f1,0.71779
eval/loss,0.60455
eval/runtime,3.3621
eval/samples_per_second,27.661
eval/steps_per_second,4.759
total_flos,484124341862400.0
train/epoch,5.0
train/global_step,310.0
train/grad_norm,1.63552


In [None]:
# ─── 8) Evaluate on hf_test ────────────────────────────────────────────────────
import numpy as np
import pandas as pd
import os
from sklearn.metrics import classification_report, confusion_matrix
from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments

# 8a) Load the dual-stopped model
model = AutoModelForSequenceClassification.from_pretrained("./best_replay_model_pldqa_dual")

# 8b) Prepare the Trainer for evaluation
test_args = TrainingArguments(
    output_dir                ="./results_test_replay_pldqa_dual",
    per_device_eval_batch_size=16,
    report_to=[]
)
test_trainer = Trainer(
    model           = model,
    args            = test_args,
    compute_metrics = compute_metrics  # as defined earlier
)

# 8c) Run prediction → returns PredictionOutput
test_output = test_trainer.predict(hf_test)
print("→ Final Test metrics:", test_output.metrics)

# ensure output directory exists
os.makedirs(test_args.output_dir, exist_ok=True)

# 8d) Save the JSON of all test metrics
test_trainer.save_metrics("test", test_output.metrics)
print(f"Saved all_results.json in {test_args.output_dir}")

# 8e) Save raw predictions
preds = test_output.predictions
np.save(os.path.join(test_args.output_dir, "test_predictions.npy"), preds)
print(f"Saved test_predictions.npy in {test_args.output_dir}")

# 8f) Classification report → CSV
y_true = df_test["label_binary"].to_numpy()
y_pred = np.argmax(preds, axis=-1)
clf_dict = classification_report(y_true, y_pred, output_dict=True)
print(classification_report(y_true, y_pred))

pd.DataFrame(clf_dict).T.to_csv(
    os.path.join(test_args.output_dir, "classification_report_replay_dual.csv")
)
print("Saved classification_report_replay_dual.csv")

# 8g) Confusion matrix → CSV
cm = confusion_matrix(y_true, y_pred)
print("Confusion matrix:\n", cm)
cm_df = pd.DataFrame(
    cm,
    index=["true_0", "true_1"],
    columns=["pred_0", "pred_1"]
)
cm_df.to_csv(
    os.path.join(test_args.output_dir, "confusion_matrix_replay_dual.csv")
)
print("Saved confusion_matrix_replay_dual.csv")


→ Final Test metrics: {'test_loss': 0.6224080920219421, 'test_model_preparation_time': 0.0046, 'test_accuracy': 0.6818181818181818, 'test_f1': 0.6799815285153543, 'test_runtime': 4.1765, 'test_samples_per_second': 31.605, 'test_steps_per_second': 2.155}
Saved all_results.json in ./results_test_replay_pldqa_dual
Saved test_predictions.npy in ./results_test_replay_pldqa_dual
              precision    recall  f1-score   support

           0       0.77      0.65      0.70        77
           1       0.60      0.73      0.66        55

    accuracy                           0.68       132
   macro avg       0.68      0.69      0.68       132
weighted avg       0.70      0.68      0.68       132

Saved classification_report_replay_dual.csv
Confusion matrix:
 [[50 27]
 [15 40]]
Saved confusion_matrix_replay_dual.csv


#### 2.2 PLDQA TESTING: RETRAIN

In [None]:
# Checking the checkpoint
model_checkpoint

'./danish-bert-adapted-pldqa/danish-bert-adapted-pldqa'

In [None]:
from transformers import (
    Trainer,
    TrainingArguments,
    TrainerCallback,
    AutoConfig,
    AutoModelForSequenceClassification
)
import numpy as np
import pandas as pd
from datasets import concatenate_datasets
from sklearn.metrics import accuracy_score, f1_score

# 1) Combine train + val
hf_trainval = concatenate_datasets([hf_train, hf_val])

# 2) Use the same best_hp from sweep‐selection step
# e.g. best_hp = { "learning_rate": ..., "per_device_train_batch_size": ..., ... }
print("Re-training with hyperparameters:", best_hp)

# 3) Initialize a fresh model from the original pretrained_ALL checkpoint (defined earlier)
config = AutoConfig.from_pretrained(model_checkpoint, num_labels=2)
model  = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, config=config)

# 4) Define a training‐only metrics function
def compute_metrics_train(p):
    preds = np.argmax(p.predictions, axis=1)
    return {
        "train_accuracy": accuracy_score(p.label_ids, preds),
        "train_f1":       f1_score(p.label_ids, preds, average="weighted"),
        # Note: Trainer logs train_loss automatically in p.metrics
    }

# 5) Callback to log per‐epoch training metrics
class TrainLoggingCallback(TrainerCallback):
    def __init__(self):
        self.history = []
    def on_epoch_end(self, args, state, control, **kwargs):
        m = self.trainer.evaluate(self.trainer.train_dataset, metric_key_prefix="train")
        self.history.append(dict(epoch=state.epoch, **m))
        print(
            f"Epoch {int(state.epoch)} | "
            f"loss {m['train_loss']:.4f} "
            f"acc  {m['train_accuracy']:.4f} "
            f"f1   {m['train_f1']:.4f}"
        )

cb = TrainLoggingCallback()

# 6) Set up and run Trainer without any eval_dataset
trainer = Trainer(
    model           = model,
    args            = TrainingArguments(
        output_dir                 = "./best_final_dual_retrained_pldqa_model",
        learning_rate              = best_hp["learning_rate"],
        per_device_train_batch_size= best_hp["per_device_train_batch_size"],
        num_train_epochs           = best_hp["num_train_epochs"],
        weight_decay               = best_hp["weight_decay_hyperparam"],
        warmup_ratio               = best_hp["warmup_ratio"],
        #lr_scheduler_type          = best_hp.get("lr_scheduler_type","linear"), # not used atm

        eval_strategy              = "no",    # skip validation
        logging_strategy           = "no",
        save_strategy              = "epoch",
        report_to                  = []       # turn off W&B
    ),
    train_dataset   = hf_trainval,
    compute_metrics = compute_metrics_train,
    callbacks       = [cb]
)
cb.trainer = trainer

trainer.train() # While this runs it shows a table incl the header Validation loss - this is just the generic table-setting from trainer, but it is train loss
trainer.save_model("./best_final_dual_retrained_pldqa_model")  # writes config.json, pytorch_model.bin, etc.

# 7) Inspect & save the training history
train_df = pd.DataFrame(cb.history).set_index("epoch")
train_df.to_csv("train_history_dual_retrained.csv")
print(train_df)


  block_group = [InMemoryTable(cls._concat_blocks(list(block_group), axis=axis))]
  table = cls._concat_blocks(blocks, axis=0)
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at ./danish-bert-adapted-pldqa/danish-bert-adapted-pldqa and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Re-training with hyperparameters: {'learning_rate': 1.2489078867398649e-05, 'per_device_train_batch_size': 6, 'num_train_epochs': 7, 'weight_decay_hyperparam': 0.14657288285157408, 'warmup_ratio': 0}


Step,Training Loss,Validation Loss,Accuracy,F1
77,No log,0.532581,0.739696,0.715267
154,No log,0.376841,0.876356,0.876396
231,No log,0.230975,0.941432,0.941511
308,No log,0.136874,0.965293,0.965315
385,No log,0.074422,0.986985,0.98698
462,No log,0.051385,0.991323,0.991323
539,No log,0.041566,0.991323,0.991323


Epoch 1 | loss 0.5326 acc  0.7397 f1   0.7153
Epoch 2 | loss 0.3768 acc  0.8764 f1   0.8764
Epoch 3 | loss 0.2310 acc  0.9414 f1   0.9415
Epoch 4 | loss 0.1369 acc  0.9653 f1   0.9653
Epoch 5 | loss 0.0744 acc  0.9870 f1   0.9870
Epoch 6 | loss 0.0514 acc  0.9913 f1   0.9913
Epoch 7 | loss 0.0416 acc  0.9913 f1   0.9913
       train_accuracy  train_f1  train_loss
epoch                                      
1.0          0.739696  0.715267    0.532581
2.0          0.876356  0.876396    0.376841
3.0          0.941432  0.941511    0.230975
4.0          0.965293  0.965315    0.136874
5.0          0.986985  0.986980    0.074422
6.0          0.991323  0.991323    0.051385
7.0          0.991323  0.991323    0.041566


In [None]:
# ─── 8) Evaluate Retrained Dual-Stopped Model on hf_test ─────────────────────────
import os
import numpy as np
import pandas as pd
from sklearn.metrics import classification_report, confusion_matrix
from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments

# 8a) Load the retrained checkpoint
model = AutoModelForSequenceClassification.from_pretrained(
    "./best_final_dual_retrained_pldqa_model"
)

# 8b) Prepare the Trainer for evaluation
test_args = TrainingArguments(
    output_dir                ="./results_test_retrained_dual_pldqa",
    per_device_eval_batch_size=16,
    report_to=[]
)
test_trainer = Trainer(
    model           = model,
    args            = test_args,
    compute_metrics = compute_metrics    # the original compute_metrics fn
)

# 8c) Run prediction → returns PredictionOutput
test_output = test_trainer.predict(hf_test)
print("→ Final Retrained Test metrics:", test_output.metrics)

# ensure output directory exists
os.makedirs(test_args.output_dir, exist_ok=True)

# 8d) Save the JSON of all test metrics
test_trainer.save_metrics("test", test_output.metrics)
print(f"Saved all_results.json in {test_args.output_dir}")

# 8e) Save raw predictions
preds = test_output.predictions
np.save(os.path.join(test_args.output_dir, "test_predictions.npy"), preds)
print(f"Saved test_predictions.npy in {test_args.output_dir}")

# 8f) Classification report → CSV
y_true = df_test["label_binary"].to_numpy()
y_pred = np.argmax(preds, axis=-1)
print(classification_report(y_true, y_pred))
clf_dict = classification_report(y_true, y_pred, output_dict=True)
pd.DataFrame(clf_dict).T.to_csv(
    os.path.join(test_args.output_dir, "classification_report_retrained_dual.csv")
)
print("Saved classification_report_retrained_dual.csv")

# 8g) Confusion matrix → CSV
cm = confusion_matrix(y_true, y_pred)
print("Confusion matrix:\n", cm)
cm_df = pd.DataFrame(
    cm,
    index  =["true_0", "true_1"],
    columns=["pred_0", "pred_1"]
)
cm_df.to_csv(
    os.path.join(test_args.output_dir, "confusion_matrix_retrained_dual.csv")
)
print("Saved confusion_matrix_retrained_dual.csv")


→ Final Retrained Test metrics: {'test_loss': 0.5614817142486572, 'test_model_preparation_time': 0.005, 'test_accuracy': 0.75, 'test_f1': 0.7406988511220907, 'test_runtime': 4.1556, 'test_samples_per_second': 31.764, 'test_steps_per_second': 2.166}
Saved all_results.json in ./results_test_retrained_dual_pldqa
Saved test_predictions.npy in ./results_test_retrained_dual_pldqa
              precision    recall  f1-score   support

           0       0.78      0.81      0.79        77
           1       0.71      0.67      0.69        55

    accuracy                           0.75       132
   macro avg       0.74      0.74      0.74       132
weighted avg       0.75      0.75      0.75       132

Saved classification_report_retrained_dual.csv
Confusion matrix:
 [[62 15]
 [18 37]]
Saved confusion_matrix_retrained_dual.csv


### 3 BERT-base (baseline)

In [None]:
# Loading in the things I need - including the Danish-bert directly from huggingface
model_checkpoint = "Maltehb/danish-bert-botxo"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

def tokenize(batch):
    return tokenizer(batch["input_text"],
                     truncation=True,
                     padding="max_length",
                     max_length=512)


all_cols = ['input_text','label_binary']
hf_train = Dataset.from_pandas(df_train[all_cols].reset_index(drop=True))
hf_val   = Dataset.from_pandas(df_val  [all_cols].reset_index(drop=True))
hf_test  = Dataset.from_pandas(df_test [all_cols].reset_index(drop=True))

# map, rename, and rebind each split
hf_train = (
    hf_train
    .map(tokenize, batched=True)
    .rename_column("label_binary", "labels")
)
hf_val = (
    hf_val
    .map(tokenize, batched=True)
    .rename_column("label_binary", "labels")
)
hf_test = (
    hf_test
    .map(tokenize, batched=True)
    .rename_column("label_binary", "labels")
)

# now explicitly set the columns wanted returned
hf_train.set_format("torch", columns=["input_ids", "attention_mask", "labels"])
hf_val  .set_format("torch", columns=["input_ids", "attention_mask", "labels"])
hf_test .set_format("torch", columns=["input_ids", "attention_mask", "labels"])


# ─── 7) Metrics & callback ────────────────────────

accuracy = evaluate.load("accuracy")
f1       = evaluate.load("f1")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    return {
        "accuracy": accuracy.compute(predictions=preds, references=labels)["accuracy"],
        "f1":       f1.compute(predictions=preds, references=labels, average="macro")["f1"]
    }

from transformers import TrainerCallback
class TrainMetricsCallback(TrainerCallback):
    def __init__(self, trainer=None):
        self.trainer = trainer

        ## We use predict() here on the *training* set purely for logging.
        # WARNING: Hugging Face’s WandB integration will log these under "test/…"
        # even though this is training-data performance, not true test-set metrics.
    def on_epoch_end(self, args, state, control, **kwargs):
        if not self.trainer: return
        pred = self.trainer.predict(self.trainer.train_dataset)
        p = np.argmax(pred.predictions, axis=-1)
        l = pred.label_ids
        wandb.log({
            "train/accuracy": accuracy.compute(predictions=p, references=l)["accuracy"],
            "train/f1":       f1.compute(predictions=p, references=l, average="macro")["f1"],
            "train/loss":     pred.metrics["test_loss"],
            "epoch":          state.epoch
        })

tokenizer_config.json:   0%|          | 0.00/378 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/253k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/498k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Map:   0%|          | 0/368 [00:00<?, ? examples/s]

Map:   0%|          | 0/93 [00:00<?, ? examples/s]

Map:   0%|          | 0/132 [00:00<?, ? examples/s]

#### 3.1 BERT-base (baseline) TESTING: LOAD IN

In [None]:
# ----- Identifying the best dual-criterion run
import wandb
import numpy as np

# 1) Login & point at sweep
wandb.login()
api      = wandb.Api()
project  = "pernillebrams/danish-bert-answer-base-danish-bert-binary"
sweep_id = "dci7w3rc"
sweep    = api.sweep(f"{project}/{sweep_id}")

candidates = []
for run in sweep.runs:
    summary = run.summary
    f1 = summary.get("eval/f1", None)
    if f1 is None:
        continue  # no F1 logged

    # 2) pull down the eval/loss history
    hist = run.history(keys=["eval/loss"], pandas=False)  # a list of dicts
    losses = [row["eval/loss"] for row in hist if row.get("eval/loss") is not None]

    if len(losses) < 2:
        continue  # not enough points to establish a trend

    # 3) checking if loss is trending downward
    #    here we simply compare the first vs last logged loss
    if losses[-1] < losses[0]:
        candidates.append((run, f1, losses[0], losses[-1]))

# 4) picking the winner with the highest eval/f1
if not candidates:
    print("No runs found with decreasing eval/loss.")
else:
    best_run, best_f1, start_loss, end_loss = max(candidates, key=lambda x: x[1])
    print(f"Selected run: {best_run.id}, sweep name: {sweep.name}, run name: {best_run.name}")
    print(f"  eval/f1 = {best_f1:.4f}")
    print(f"  eval/loss: {start_loss:.4f} → {end_loss:.4f}  (downward trend)")

    print("\nHyperparameters:")
    for hp in ["learning_rate","per_device_train_batch_size","num_train_epochs",
               "weight_decay_hyperparam","warmup_ratio"]:
        print(f"  {hp}: {best_run.config.get(hp)}")

# 4) extracting the hyperparameters
hp_keys = [
    "learning_rate",
    "per_device_train_batch_size",
    "num_train_epochs",
    "weight_decay_hyperparam",
    "warmup_ratio",
    #"lr_scheduler_type",
    #"dropout"
]

best_hp = {k: best_run.config[k] for k in hp_keys}
print("Best hyperparameters extracted:", best_hp)

Selected run: cli4lawu, sweep name: dci7w3rc, run name: devout-sweep-12
  eval/f1 = 0.7738
  eval/loss: 0.6998 → 0.6932  (downward trend)

Hyperparameters:
  learning_rate: 2.795757662192855e-05
  per_device_train_batch_size: 6
  num_train_epochs: 6
  weight_decay_hyperparam: 0.16160686772002836
  warmup_ratio: 0.06
Best hyperparameters extracted: {'learning_rate': 2.795757662192855e-05, 'per_device_train_batch_size': 6, 'num_train_epochs': 6, 'weight_decay_hyperparam': 0.16160686772002836, 'warmup_ratio': 0.06}


In [None]:
# -- Rerunning the exact sweep run, but with dual-monitoring early stopping
# ─── 2) Metrics fn ─────────────────────────────────────────────────────────────
accuracy = evaluate.load("accuracy")
f1       = evaluate.load("f1")
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    return {
        "accuracy": accuracy.compute(predictions=preds, references=labels)["accuracy"],
        "f1":       f1.compute(predictions=preds, references=labels, average="macro")["f1"]
    }

# ─── 3) Best hyperparameters (retrieved from above dynamically) ────────────────────────────────────────────────────
# In this: best_hp

# ─── 4) WandB init (for logging) ──────────────────────────────────────
wandb.init(
    project="danish-bert-answer-base-danish-bert-binary",
    entity="pernillebrams",
    name="manual-replay-best-danish-bert-base-dual",
    config=best_hp,
    reinit=True
)

# ─── 5) Build model & TrainingArguments ────────────────────────────────────────
config = AutoModelForSequenceClassification.from_pretrained(model_checkpoint).config
config.num_labels = 2
#config.hidden_dropout_prob          = best_hp["dropout"]
#config.attention_probs_dropout_prob = best_hp["dropout"]

model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, config=config)

from transformers import TrainerCallback

# Make a dual trainer early stop setting
class LossAndF1EarlyStop(TrainerCallback):
    def __init__(self, loss_patience=1, f1_patience=2):
        self.loss_patience, self.f1_patience = loss_patience, f1_patience
        self.best_loss, self.loss_wait = float('inf'), 0
        self.best_f1, self.f1_wait   = 0, 0

    def on_evaluate(self, args, state, control, metrics=None, **_):
        loss = metrics["eval_loss"]
        f1   = metrics["eval_f1"]
        # track loss
        if loss < self.best_loss:
            self.best_loss, self.loss_wait = loss, 0
        else:
            self.loss_wait += 1
        # track f1
        if f1 > self.best_f1:
            self.best_f1, self.f1_wait = f1, 0
        else:
            self.f1_wait += 1
        # stop if either has stalled
        if self.loss_wait > self.loss_patience or self.f1_wait > self.f1_patience:
            control.should_training_stop = True
        return control


args = TrainingArguments(
    output_dir                ="./best_replay_model_basebert_dual",
    learning_rate             =best_hp["learning_rate"],
    per_device_train_batch_size=best_hp["per_device_train_batch_size"],
    per_device_eval_batch_size =best_hp["per_device_train_batch_size"],
    num_train_epochs          =best_hp["num_train_epochs"],
    weight_decay              =best_hp["weight_decay_hyperparam"],
    warmup_ratio              =best_hp["warmup_ratio"],
    #lr_scheduler_type         =best_hp["lr_scheduler_type"], # not used atm

    eval_strategy             ="epoch",
    save_strategy             ="epoch",
    logging_strategy          ="epoch",
    load_best_model_at_end    =True,
    metric_for_best_model     ="eval_f1",
    greater_is_better         =True,

    report_to=["wandb"]
)

# ─── 6) Trainer & EarlyStopping (exactly like sweep) ───────────────────────────
trainer = Trainer(
    model           =model,
    args            =args,
    train_dataset   =hf_train,
    eval_dataset    =hf_val,
    compute_metrics =compute_metrics,
    #callbacks       =[EarlyStoppingCallback(early_stopping_patience=2)]
    callbacks        =[LossAndF1EarlyStop(loss_patience=1, f1_patience=2)] # guarantee halt and pick checkpoint before either metric shows true over‐fitting, rather than only watching F
)

# ─── 7) Run & save ─────────────────────────────────────────────────────────────
trainer.train()
trainer.save_model("./best_replay_model_basebert_dual")  # checkpoint files go here

wandb.finish()

config.json:   0%|          | 0.00/846 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/445M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at Maltehb/danish-bert-botxo and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


model.safetensors:   0%|          | 0.00/445M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at Maltehb/danish-bert-botxo and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.7205,0.655505,0.580645,0.473203
2,0.6124,0.581449,0.698925,0.675474
3,0.4421,0.589946,0.698925,0.698611
4,0.2296,0.627707,0.784946,0.782913


0,1
eval/accuracy,▁▅▅█
eval/f1,▁▆▆█
eval/loss,█▁▂▅
eval/runtime,▁█▃▃
eval/samples_per_second,█▁▆▅
eval/steps_per_second,█▁▆▅
train/epoch,▁▁▃▃▆▆███
train/global_step,▁▁▃▃▆▆███
train/grad_norm,▃█▃▁
train/learning_rate,█▆▃▁

0,1
eval/accuracy,0.78495
eval/f1,0.78291
eval/loss,0.62771
eval/runtime,3.3424
eval/samples_per_second,27.825
eval/steps_per_second,4.787
total_flos,387299473489920.0
train/epoch,4.0
train/global_step,248.0
train/grad_norm,4.4126


In [None]:
# ─── 8) Evaluate on hf_test ────────────────────────────────────────────────────
import numpy as np
import pandas as pd
import os
from sklearn.metrics import classification_report, confusion_matrix
from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments

# 8a) Load the dual-stopped model
model = AutoModelForSequenceClassification.from_pretrained("./best_replay_model_basebert_dual")

# 8b) Prepare the Trainer for evaluation
test_args = TrainingArguments(
    output_dir                ="./results_test_replay_basebert_dual",
    per_device_eval_batch_size=16,
    report_to=[]
)
test_trainer = Trainer(
    model           = model,
    args            = test_args,
    compute_metrics = compute_metrics  # as defined earlier
)

# 8c) Run prediction → returns PredictionOutput
test_output = test_trainer.predict(hf_test)
print("→ Final Test metrics:", test_output.metrics)

# ensure output directory exists
os.makedirs(test_args.output_dir, exist_ok=True)

# 8d) Save the JSON of all test metrics
test_trainer.save_metrics("test", test_output.metrics)
print(f"Saved all_results.json in {test_args.output_dir}")

# 8e) Save raw predictions
preds = test_output.predictions
np.save(os.path.join(test_args.output_dir, "test_predictions.npy"), preds)
print(f"Saved test_predictions.npy in {test_args.output_dir}")

# 8f) Classification report → CSV
y_true = df_test["label_binary"].to_numpy()
y_pred = np.argmax(preds, axis=-1)
clf_dict = classification_report(y_true, y_pred, output_dict=True)
print(classification_report(y_true, y_pred))

pd.DataFrame(clf_dict).T.to_csv(
    os.path.join(test_args.output_dir, "classification_report_replay_dual.csv")
)
print("Saved classification_report_replay_dual.csv")

# 8g) Confusion matrix → CSV
cm = confusion_matrix(y_true, y_pred)
print("Confusion matrix:\n", cm)
cm_df = pd.DataFrame(
    cm,
    index=["true_0", "true_1"],
    columns=["pred_0", "pred_1"]
)
cm_df.to_csv(
    os.path.join(test_args.output_dir, "confusion_matrix_replay_dual.csv")
)
print("Saved confusion_matrix_replay_dual.csv")


→ Final Test metrics: {'test_loss': 0.7440537214279175, 'test_model_preparation_time': 0.0029, 'test_accuracy': 0.6742424242424242, 'test_f1': 0.6600179694519317, 'test_runtime': 4.3722, 'test_samples_per_second': 30.191, 'test_steps_per_second': 2.058}
Saved all_results.json in ./best_replay_model_basebert_dual
Saved test_predictions.npy in ./best_replay_model_basebert_dual
              precision    recall  f1-score   support

           0       0.71      0.75      0.73        77
           1       0.62      0.56      0.59        55

    accuracy                           0.67       132
   macro avg       0.66      0.66      0.66       132
weighted avg       0.67      0.67      0.67       132

Saved classification_report_replay_dual.csv
Confusion matrix:
 [[58 19]
 [24 31]]
Saved confusion_matrix_replay_dual.csv


#### 3.2 BERT-base (baseline) TESTING: RETRAIN

In [None]:
# Checking the checkpoint
model_checkpoint

'Maltehb/danish-bert-botxo'

In [None]:
from transformers import (
    Trainer,
    TrainingArguments,
    TrainerCallback,
    AutoConfig,
    AutoModelForSequenceClassification
)
import numpy as np
import pandas as pd
from datasets import concatenate_datasets
from sklearn.metrics import accuracy_score, f1_score

# 1) Combine train + val
hf_trainval = concatenate_datasets([hf_train, hf_val])

# 2) Use the same best_hp from sweep‐selection step
# e.g. best_hp = { "learning_rate": ..., "per_device_train_batch_size": ..., ... }
print("Re-training with hyperparameters:", best_hp)

# 3) Initialize a fresh model from the original pretrained_ALL checkpoint (defined earlier)
config = AutoConfig.from_pretrained(model_checkpoint, num_labels=2)
model  = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, config=config)

# 4) Define a training‐only metrics function
def compute_metrics_train(p):
    preds = np.argmax(p.predictions, axis=1)
    return {
        "train_accuracy": accuracy_score(p.label_ids, preds),
        "train_f1":       f1_score(p.label_ids, preds, average="weighted"),
        # Note: Trainer logs train_loss automatically in p.metrics
    }

# 5) Callback to log per‐epoch training metrics
class TrainLoggingCallback(TrainerCallback):
    def __init__(self):
        self.history = []
    def on_epoch_end(self, args, state, control, **kwargs):
        m = self.trainer.evaluate(self.trainer.train_dataset, metric_key_prefix="train")
        self.history.append(dict(epoch=state.epoch, **m))
        print(
            f"Epoch {int(state.epoch)} | "
            f"loss {m['train_loss']:.4f} "
            f"acc  {m['train_accuracy']:.4f} "
            f"f1   {m['train_f1']:.4f}"
        )

cb = TrainLoggingCallback()

# 6) Set up and run Trainer without any eval_dataset
trainer = Trainer(
    model           = model,
    args            = TrainingArguments(
        output_dir                 = "./best_final_dual_retrained_bertbase_model",
        learning_rate              = best_hp["learning_rate"],
        per_device_train_batch_size= best_hp["per_device_train_batch_size"],
        num_train_epochs           = best_hp["num_train_epochs"],
        weight_decay               = best_hp["weight_decay_hyperparam"],
        warmup_ratio               = best_hp["warmup_ratio"],
        #lr_scheduler_type          = best_hp.get("lr_scheduler_type","linear"), # not used atm

        eval_strategy              = "no",    # skip validation
        logging_strategy           = "no",
        save_strategy              = "epoch",
        report_to                  = []       # turn off W&B
    ),
    train_dataset   = hf_trainval,
    compute_metrics = compute_metrics_train,
    callbacks       = [cb]
)
cb.trainer = trainer

trainer.train() # While this runs it shows a table incl the header Validation loss - this is just the generic table-setting from trainer, but it is train loss
trainer.save_model("./best_final_dual_retrained_bertbase_model")  # writes config.json, pytorch_model.bin, etc.

# 7) Inspect & save the training history
train_df = pd.DataFrame(cb.history).set_index("epoch")
train_df.to_csv("train_history_dual_retrained.csv")
print(train_df)


  block_group = [InMemoryTable(cls._concat_blocks(list(block_group), axis=axis))]
  table = cls._concat_blocks(blocks, axis=0)


Re-training with hyperparameters: {'learning_rate': 2.795757662192855e-05, 'per_device_train_batch_size': 6, 'num_train_epochs': 6, 'weight_decay_hyperparam': 0.16160686772002836, 'warmup_ratio': 0.06}


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at Maltehb/danish-bert-botxo and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss,Validation Loss,Accuracy,F1
77,No log,0.553905,0.739696,0.718409
154,No log,0.399363,0.815618,0.815551
231,No log,0.236346,0.913232,0.913297
308,No log,0.052764,0.984816,0.9848
385,No log,0.023687,0.995662,0.99566
462,No log,0.017911,0.997831,0.99783


Epoch 1 | loss 0.5539 acc  0.7397 f1   0.7184
Epoch 2 | loss 0.3994 acc  0.8156 f1   0.8156
Epoch 3 | loss 0.2363 acc  0.9132 f1   0.9133
Epoch 4 | loss 0.0528 acc  0.9848 f1   0.9848
Epoch 5 | loss 0.0237 acc  0.9957 f1   0.9957
Epoch 6 | loss 0.0179 acc  0.9978 f1   0.9978
       train_accuracy  train_f1  train_loss
epoch                                      
1.0          0.739696  0.718409    0.553905
2.0          0.815618  0.815551    0.399363
3.0          0.913232  0.913297    0.236346
4.0          0.984816  0.984800    0.052764
5.0          0.995662  0.995660    0.023687
6.0          0.997831  0.997830    0.017911


In [None]:
# ─── 8) Evaluate Retrained Dual-Stopped Model on hf_test ─────────────────────────
import os
import numpy as np
import pandas as pd
from sklearn.metrics import classification_report, confusion_matrix
from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments

# 8a) Load the retrained checkpoint
model = AutoModelForSequenceClassification.from_pretrained(
    "./best_final_dual_retrained_bertbase_model"
)

# 8b) Prepare the Trainer for evaluation
test_args = TrainingArguments(
    output_dir                ="./results_test_retrained_dual_bertbase",
    per_device_eval_batch_size=16,
    report_to=[]
)
test_trainer = Trainer(
    model           = model,
    args            = test_args,
    compute_metrics = compute_metrics    # the original compute_metrics fn
)

# 8c) Run prediction → returns PredictionOutput
test_output = test_trainer.predict(hf_test)
print("→ Final Retrained Test metrics:", test_output.metrics)

# ensure output directory exists
os.makedirs(test_args.output_dir, exist_ok=True)

# 8d) Save the JSON of all test metrics
test_trainer.save_metrics("test", test_output.metrics)
print(f"Saved all_results.json in {test_args.output_dir}")

# 8e) Save raw predictions
preds = test_output.predictions
np.save(os.path.join(test_args.output_dir, "test_predictions.npy"), preds)
print(f"Saved test_predictions.npy in {test_args.output_dir}")

# 8f) Classification report → CSV
y_true = df_test["label_binary"].to_numpy()
y_pred = np.argmax(preds, axis=-1)
print(classification_report(y_true, y_pred))
clf_dict = classification_report(y_true, y_pred, output_dict=True)
pd.DataFrame(clf_dict).T.to_csv(
    os.path.join(test_args.output_dir, "classification_report_retrained_dual.csv")
)
print("Saved classification_report_retrained_dual.csv")

# 8g) Confusion matrix → CSV
cm = confusion_matrix(y_true, y_pred)
print("Confusion matrix:\n", cm)
cm_df = pd.DataFrame(
    cm,
    index  =["true_0", "true_1"],
    columns=["pred_0", "pred_1"]
)
cm_df.to_csv(
    os.path.join(test_args.output_dir, "confusion_matrix_retrained_dual.csv")
)
print("Saved confusion_matrix_retrained_dual.csv")


→ Final Retrained Test metrics: {'test_loss': 1.2037389278411865, 'test_model_preparation_time': 0.0029, 'test_accuracy': 0.6590909090909091, 'test_f1': 0.6574987026466009, 'test_runtime': 4.0622, 'test_samples_per_second': 32.494, 'test_steps_per_second': 2.216}
Saved all_results.json in ./results_test_retrained_dual_bertbase
Saved test_predictions.npy in ./results_test_retrained_dual_bertbase
              precision    recall  f1-score   support

           0       0.75      0.62      0.68        77
           1       0.57      0.71      0.63        55

    accuracy                           0.66       132
   macro avg       0.66      0.67      0.66       132
weighted avg       0.68      0.66      0.66       132

Saved classification_report_retrained_dual.csv
Confusion matrix:
 [[48 29]
 [16 39]]
Saved confusion_matrix_retrained_dual.csv
