<center><br><br>
<font size=6>🎓 <b>Advanced Deep Learning - NLP Final Project</b></font><br>
<font size=6>📊 <b>mDeBERTa-v3-base Training (EDA)</b></font><br>
<font size=5>👥 <b>Group W</b></font><br><br>
<b>Adi Shalit</b>, ID: <code>206628885</code><br>
<b>Gal Gussarsky</b>, ID: <code>206453540</code><br><br>
<font size=4>📘 Course ID: <code>05714184</code></font><br>
<font size=4>📅 Spring 2025</font>
<br><br>
<hr style="width:60%; border:1px solid gray;"></center>


# 📑 Table of Contents

- [Training EX4](#Training-EX4)
- [Load best Model & Test EX4](#Load-Best-Model-EX4)
- [Training EX5](#Training-EX5)
- [Load best Model & Test EX5](#Load-Best-Model-EX5)




# 🚀 Training **mDeBERTa-v3-base** (Exercise 4 Style)

In this stage, we fine-tune **mDeBERTa-v3-base** for **5-class sentiment classification**.  
DeBERTa improves upon BERT/RoBERTa by introducing **disentangled attention** (separating content & position) and **relative position embeddings**, which allow it to capture context and word order more effectively.  

---

## 🎯 **What we do here**
- ⚡ **Pretrained Backbone**: Start from `microsoft/mdeberta-v3-base`.
- 🔒 **Layer Freezing**: Freeze most layers, unfreezing only the **last *k*** (with *k* optimized by Optuna).
- 🏷 **Classification Head**: Add a linear classifier for **5 sentiment classes**.
- 🎛 **Hyperparameter Tuning**: Use **Optuna** to search learning rate, weight decay, batch size, etc.
- ⏳ **Early Stopping**: Stop training when validation **F1** stops improving.
- 📊 **Experiment Tracking**: Log all runs with **Weights & Biases (W&B)**.
- ⚙️ **Training Tricks**: Mixed precision + gradient clipping for faster and more stable convergence.  

---



In [1]:
# import os
# os.environ["TRANSFORMERS_NO_TF"] = "1"   # force-disable TensorFlow backend
# os.environ["TRANSFORMERS_NO_FLAX"] = "1" # optional: also disable Flax/JAX


# Training EX4

In [1]:
# --- Cell 1: setup & speed knobs ---
# !pip  optuna==3.6.1 wandb==0.17.5

import os, math, random, time
from typing import Dict, Tuple
import numpy as np
import pandas as pd

import torch
from torch.utils.data import Dataset, DataLoader
from torch.cuda.amp import autocast, GradScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

import optuna
import wandb
from transformers import (
    AutoTokenizer, AutoModelForSequenceClassification,
    get_linear_schedule_with_warmup
)


# -------- RTX 4090 performance switches --------
os.environ.setdefault("PYTORCH_CUDA_ALLOC_CONF", "expandable_segments:True")
if torch.cuda.is_available():
    torch.backends.cuda.matmul.allow_tf32 = True
    torch.backends.cudnn.allow_tf32 = True
    try:
        torch.set_float32_matmul_precision("high")
    except Exception:
        pass

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
SEED = 42
def set_seed(s=42):
    random.seed(s); np.random.seed(s)
    torch.manual_seed(s); torch.cuda.manual_seed_all(s)
    torch.backends.cudnn.deterministic = False   # allow fast kernels
    torch.backends.cudnn.benchmark = True
set_seed(SEED)

# --- W&B non-blocking defaults ---
os.environ.setdefault("WANDB_MODE", "online")  # set to "offline" if you prefer
WANDB_PROJECT = "adv-dl-deberta-v3"


  from .autonotebook import tqdm as notebook_tqdm


# 🌍 mDeBERTa-v3-base Description

After first experimenting with **RoBERTa-base**, we now move to a stronger multilingual model:  
**mDeBERTa-v3-base** (multilingual DeBERTa v3).  

---

## 🔎 Why mDeBERTa?
- 🧠 **Better architecture** → DeBERTa introduces **disentangled attention** (separating content vs. position info) and **relative position embeddings**, which help the model better capture context and word order.  
- ⚡ **DeBERTa v3** → improves pretraining with **ELECTRA-style gradient-disentangled embedding sharing**, making training more efficient and boosting accuracy.  
- 🌐 **Multilingual support** → pretrained on **CC100 dataset** (~2.5T tokens, 250k vocab), enabling cross-lingual transfer.  
- 📏 **Model size** → 12 Transformer layers, hidden size 768, ~279M parameters (all trainable in our setup).  

---

## 🏋️ Training Plan
Initially, we considered training also on the **original untranslated dataset**, but due to time and complexity we decided to **focus only on the clean translated version**.  
Thus, the final plan is:  

- ✅ **Train** on the **clean translated training data**.  
- ✅ **Evaluate** on the **clean translated test set** (same conditions as training).  

✨ This ensures a controlled, consistent training pipeline, while letting us directly compare mDeBERTa’s performance with RoBERTa under the same clean setup.  


In [2]:
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(
    "microsoft/mdeberta-v3-base", num_labels=5
).to(DEVICE)

total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Trainable params: {trainable_params:,} / {total_params:,} "
      f"({100.0*trainable_params/total_params:.2f}%)")


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/mdeberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Trainable params: 278,813,189 / 278,813,189 (100.00%)


  return t.to(


In [3]:
model

DebertaV2ForSequenceClassification(
  (deberta): DebertaV2Model(
    (embeddings): DebertaV2Embeddings(
      (word_embeddings): Embedding(251000, 768, padding_idx=0)
      (LayerNorm): LayerNorm((768,), eps=1e-07, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): DebertaV2Encoder(
      (layer): ModuleList(
        (0-11): 12 x DebertaV2Layer(
          (attention): DebertaV2Attention(
            (self): DisentangledSelfAttention(
              (query_proj): Linear(in_features=768, out_features=768, bias=True)
              (key_proj): Linear(in_features=768, out_features=768, bias=True)
              (value_proj): Linear(in_features=768, out_features=768, bias=True)
              (pos_dropout): Dropout(p=0.1, inplace=False)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): DebertaV2SelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): Layer

# 📊 First Training Results – Hyperparameter Insights

https://wandb.ai/adishalit1-tel-aviv-university/adv-dl-p2?nw=nwuseradishalit1

In the **first training phase** with **mDeBERTa-v3-base**, we used **Optuna** to explore a wide hyperparameter space.  
Although prior research suggests that DeBERTa works best with learning rates around **1e-5 – 6e-5**,  
we intentionally started with **much wider bounds (1e-6 – 5e-5)** to test stability and robustness.

---

## 🏆 Key Observations
- **⭐ Best Trial (Trial 2):**  
  - Validation F1 = **0.88022**  
  - Learning rate = **3.5e-5**  
  - Batch size = **8**  
  - Weight decay = **9.4e-5**  
  - **All 12 encoder layers unfrozen**  

  🔹 This score is almost identical to the **real-world English benchmark performance (~88.2 F1)**, confirming our setup reproduces strong results.

---

### 🔑 Hyperparameter Insights
- **Learning rate:**  
  - Best range: **3e-5 – 9e-5**  
  - Too high → unstable training  
  - Too low (<1e-5) → stable but plateaued

- **Unfrozen layers:**  
  - **9–12 layers unfrozen** performed best  
  - Full fine-tuning > partial freezing

- **Weight decay:**  
  - Small values (<1e-5) gave stability  
  - Very large → degraded performance

- **Batch size:**  
  - **8–16** = stable  
  - Larger sizes didn’t improve performance

---

## 🚀 Next Steps
- 🔎 Narrow LR search to **1e-5 – 5e-5** (sweet spot).  
- ✅ Stick with **full fine-tuning (12 layers)**.  
- 📉 Use **low weight decay (<1e-5)**.  
- 🔄 Retrain with **refined bounds** for a cleaner checkpoint and more reliable results.


In [3]:
# # =========================
# # ADV DL – Part B: Monolingual baseline (RoBERTa) – Exercise-4 style
# # Custom loop + early stopping + W&B + Optuna ONLY; freeze base, unfreeze last k layers
# # Uses df_train / df_test with columns: OriginalTweet (str), Sentiment (str)
# # =========================

# import os, math, random, time, json
# from typing import Dict, List, Tuple

# import numpy as np
# import pandas as pd
# from sklearn.model_selection import train_test_split
# from sklearn.metrics import accuracy_score, precision_recall_fscore_support, classification_report, confusion_matrix
# import torch
# from torch.utils.data import Dataset, DataLoader
# from torch.cuda.amp import autocast, GradScaler

# # ---- deps ----
# # !pip -q install transformers==4.43.3 optuna==3.6.1 wandb==0.17.5 >/dev/null

# import transformers
# from transformers import (
#     AutoTokenizer, AutoModelForSequenceClassification,
#     DataCollatorWithPadding, get_linear_schedule_with_warmup
# )

# import optuna
# import wandb

# # # from google.colab import drive
# # drive.mount("/content/drive")
# # DRIVE_OUT_DIR = "/content/drive/MyDrive/adv_dl_models"
# # os.makedirs(DRIVE_OUT_DIR, exist_ok=True)

# # -------------------------
# # Constants (no CFG, Optuna-only workflow)
# # -------------------------
# MODEL_NAME = "microsoft/mdeberta-v3-base"
# MAX_LEN = 512
# BATCH_SIZE = 16
# WARMUP_RATIO = 0.06
# GRAD_CLIP = 1.0
# USE_AMP = True
# PROJECT = "adv-dl-p2"
# BASE_RUN_NAME = "microsoft/mdeberta-v3-base_full_ex_4"
# TRIALS = 12
# SEED = 42
# DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

# def set_seed(seed=42):
#     random.seed(seed); np.random.seed(seed)
#     torch.manual_seed(seed); torch.cuda.manual_seed_all(seed)
#     torch.backends.cudnn.deterministic = True
#     torch.backends.cudnn.benchmark = False

# set_seed(SEED)

# # -------------------------
# # Label mapping (5-way sentiment)
# # -------------------------
# CANON = {
#     "extremely negative": "extremely negative",
#     "negative": "negative",
#     "neutral": "neutral",
#     "positive": "positive",
#     "extremely positive": "extremely positive",
# }
# ORDER = ["extremely negative","negative","neutral","positive","extremely positive"]
# LABEL2ID = {lab: i for i, lab in enumerate(ORDER)}
# ID2LABEL = {i: lab for lab, i in LABEL2ID.items()}

# def normalize_label(s: str) -> str:
#     s = str(s).strip().lower()
#     s = s.replace("very negative", "extremely negative")
#     s = s.replace("very positive", "extremely positive")
#     s = s.replace("extreme negative", "extremely negative")
#     s = s.replace("extreme positive", "extremely positive")
#     return CANON.get(s, s)

# # -------------------------
# # Expect df_train, df_test in memory
# # -------------------------
# assert "OriginalTweet" in df_train.columns and "Sentiment" in df_train.columns, "df_train missing required columns"
# assert "OriginalTweet" in df_test.columns and "Sentiment" in df_test.columns, "df_test missing required columns"

# def prep_df(df: pd.DataFrame) -> pd.DataFrame:
#     df = df.copy()
#     df = df.dropna(subset=["OriginalTweet", "Sentiment"])
#     df["text"] = df["OriginalTweet"].astype(str).str.strip()
#     df["label_name"] = df["Sentiment"].apply(normalize_label)
#     df = df[df["label_name"].isin(ORDER)].reset_index(drop=True)
#     df["label"] = df["label_name"].map(LABEL2ID)
#     return df[["text", "label", "label_name"]]

# dftrain_ = prep_df(df_train)
# dftest_  = prep_df(df_test)

# train_df, val_df = train_test_split(
#     dftrain_, test_size=0.1, stratify=dftrain_["label"], random_state=SEED
# )
# print(f"Train/Val/Test sizes: {len(train_df)}/{len(val_df)}/{len(dftest_)}")

# # -------------------------
# # Dataset & Collator
# # -------------------------
# class TweetDataset(Dataset):
#     def __init__(self, df: pd.DataFrame, tokenizer: transformers.PreTrainedTokenizerBase, max_len: int):
#         self.texts = df["text"].tolist()
#         self.labels = df["label"].tolist()
#         self.tok = tokenizer
#         self.max_len = max_len
#     def __len__(self): return len(self.texts)
#     def __getitem__(self, idx):
#         enc = self.tok(self.texts[idx], truncation=True, max_length=self.max_len, padding=False)
#         enc["labels"] = self.labels[idx]
#         return {k: torch.tensor(v) for k, v in enc.items()}

# tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)
# train_ds = TweetDataset(train_df, tokenizer, MAX_LEN)
# val_ds   = TweetDataset(val_df, tokenizer, MAX_LEN)
# test_ds  = TweetDataset(dftest_, tokenizer, MAX_LEN)

# collate_fn = DataCollatorWithPadding(tokenizer=tokenizer)
# train_loader = DataLoader(train_ds, batch_size=BATCH_SIZE, shuffle=True,  collate_fn=collate_fn, num_workers=2, pin_memory=True)
# val_loader   = DataLoader(val_ds,   batch_size=BATCH_SIZE, shuffle=False, collate_fn=collate_fn, num_workers=2, pin_memory=True)
# test_loader  = DataLoader(test_ds,  batch_size=BATCH_SIZE, shuffle=False, collate_fn=collate_fn, num_workers=2, pin_memory=True)

# # -------------------------
# # Model & Freeze/Unfreeze strategy
# # -------------------------
# def build_model(num_unfreeze_last_layers: int = 4):
#     model = AutoModelForSequenceClassification.from_pretrained(
#         MODEL_NAME, num_labels=len(ORDER), id2label=ID2LABEL, label2id=LABEL2ID
#     )
#     base = getattr(model, "roberta", None) or getattr(model, "bert", None) or getattr(model, "deberta", None)
#     if base is not None:
#         for p in base.parameters(): p.requires_grad = False
#         if hasattr(base, "encoder") and hasattr(base.encoder, "layer"):
#             k = num_unfreeze_last_layers
#             if k > 0:
#                 for layer in base.encoder.layer[-k:]:
#                     for p in layer.parameters(): p.requires_grad = True
#     for p in model.classifier.parameters(): p.requires_grad = True
#     return model.to(DEVICE)

# # -------------------------
# # Train / Eval utilities
# # -------------------------
# def get_optimizer_scheduler(model, num_training_steps: int, lr: float, weight_decay: float):
#     no_decay = ["bias", "LayerNorm.weight"]
#     optimizer_grouped_parameters = [
#         {"params": [p for n, p in model.named_parameters() if p.requires_grad and not any(nd in n for nd in no_decay)], "weight_decay": weight_decay},
#         {"params": [p for n, p in model.named_parameters() if p.requires_grad and any(nd in n for nd in no_decay)],  "weight_decay": 0.0},
#     ]
#     optimizer = torch.optim.AdamW(optimizer_grouped_parameters, lr=lr)
#     num_warmup = int(num_training_steps * WARMUP_RATIO)
#     scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=num_warmup, num_training_steps=num_training_steps)
#     return optimizer, scheduler

# def evaluate(model, loader) -> Dict[str, float]:
#     model.eval()
#     preds, labels = [], []
#     with torch.no_grad():
#         for batch in loader:
#             batch = {k: v.to(DEVICE) for k, v in batch.items()}
#             logits = model(**batch).logits
#             preds.extend(torch.argmax(logits, dim=-1).detach().cpu().tolist())
#             labels.extend(batch["labels"].detach().cpu().tolist())
#     acc = accuracy_score(labels, preds)
#     p, r, f1, _ = precision_recall_fscore_support(labels, preds, average="macro", zero_division=0)
#     return {"acc": acc, "precision": p, "recall": r, "f1": f1}

# def train_one_run(hp: Dict) -> Tuple[str, Dict[str, float]]:
#     """
#     hp keys: run_name, num_unfreeze_last_layers, lr, weight_decay, epochs, patience, trial_number
#     """
#     run_name = hp["run_name"]
#     num_unfreeze = int(hp["num_unfreeze_last_layers"])
#     lr = float(hp["lr"])
#     wd = float(hp["weight_decay"])
#     epochs = int(hp["epochs"])
#     patience = int(hp["patience"])

#     model = build_model(num_unfreeze)
#     total_steps = int(math.ceil(len(train_loader) * epochs))
#     optimizer, scheduler = get_optimizer_scheduler(model, total_steps, lr, wd)

#     scaler = GradScaler(enabled=(DEVICE == "cuda" and USE_AMP))
#     best_metric = -1.0
#     best_path = f"best_{run_name}.pt"
#     no_improve = 0
#     wandb_run = wandb.init(
#         project=PROJECT,
#         name=run_name,
#         config={
#             "model": MODEL_NAME,
#             "max_len": MAX_LEN,
#             "batch_size": BATCH_SIZE,
#             "epochs": epochs,
#             "lr": lr,
#             "weight_decay": wd,
#             "warmup_ratio": WARMUP_RATIO,
#             "grad_clip": GRAD_CLIP,
#             "num_unfreeze_last_layers": num_unfreeze,
#             # >>> ADDED — purely for tracking the Optuna suggestion (even if not used by loaders)
#             "suggested_batch_size": hp.get("batch_size", BATCH_SIZE),
#             "trial_number": hp.get("trial_number", -1),
#         },
#         reinit=True,
#     )

#     # >>> ADDED — define step metrics so W&B charts look great
#     wandb.define_metric("epoch")
#     wandb.define_metric("step")
#     wandb.define_metric("train/*", step_metric="step")
#     wandb.define_metric("val/*",   step_metric="epoch")

#     # >>> ADDED — visibility of trainable params
#     total_params     = sum(p.numel() for p in model.parameters())
#     trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
#     print(f"Trainable params: {trainable_params:,} / {total_params:,} "
#           f"({100.0*trainable_params/total_params:.2f}%) ; unfreeze_last_k={num_unfreeze}")
#     wandb.log({
#         "params/total": total_params,
#         "params/trainable": trainable_params,
#         "params/ratio": trainable_params / max(1, total_params),
#     }, step=0)



#     for epoch in range(epochs):
#         model.train()
#         t0 = time.time()
#         running_loss = 0.0
#         for step, batch in enumerate(train_loader):
#             batch = {k: v.to(DEVICE) for k, v in batch.items()}
#             optimizer.zero_grad(set_to_none=True)
#             with autocast(enabled=(DEVICE == "cuda" and USE_AMP)):
#                 outputs = model(**batch)
#                 loss = outputs.loss
#             scaler.scale(loss).backward()
#             if GRAD_CLIP is not None:
#                 scaler.unscale_(optimizer)
#                 torch.nn.utils.clip_grad_norm_(model.parameters(), GRAD_CLIP)
#             scaler.step(optimizer); scaler.update(); scheduler.step()
#             running_loss += loss.item()

#             if step % 20 == 0:
#                 wandb.log({"train/loss": loss.item(), "step": step + 1, "epoch": epoch + 1})

#         # epoch-end validation
#         val_metrics = evaluate(model, val_loader)
#         elapsed = time.time() - t0

#         epoch_loss = running_loss / max(1, len(train_loader))
#         current_lr = scheduler.get_last_lr()[0]
#         wandb.log({
#             "train/epoch_loss": epoch_loss,
#             "val/acc": val_metrics["acc"],
#             "val/precision": val_metrics["precision"],
#             "val/recall": val_metrics["recall"],
#             "val/f1": val_metrics["f1"],
#             "lr": current_lr,
#             "time/epoch_sec": elapsed,
#             "epoch": epoch + 1,
#         })

#         # Early stopping on val f1
#         if val_metrics["f1"] > best_metric:
#             best_metric = val_metrics["f1"]
#             torch.save(model.state_dict(), best_path)
#             no_improve = 0
#             wandb_run.summary["best_val_f1"] = best_metric
#             wandb_run.summary["best_checkpoint_path"] = best_path
#         else:
#             no_improve += 1
#             if no_improve >= patience:
#                 print(f"Early stopping at epoch {epoch+1}")
#                 break

#         print(f"Epoch {epoch+1}/{epochs} | "
#               f"loss={epoch_loss:.4f} | "
#               f"val_acc={val_metrics['acc']:.4f} | val_f1={val_metrics['f1']:.4f} | time={elapsed:.1f}s")

#     wandb.finish()

#     # Load best and return path + metrics on val for reference
#     model.load_state_dict(torch.load(best_path, map_location=DEVICE))
#     final_val = evaluate(model, val_loader)
#     return best_path, final_val

# # -------------------------
# # Optuna hyperparameter tuning (ALWAYS ON)
# # -------------------------

# # Constants
# FIXED_EPOCHS = 12
# FIXED_PATIENCE = 4


# def objective(trial: optuna.trial.Trial):
#     params = {
#         "run_name": f"{BASE_RUN_NAME}_optuna_trial_{trial.number}",
#         "num_unfreeze_last_layers": trial.suggest_int("num_unfreeze_last_layers", 4, 12),
#         "lr": trial.suggest_float("lr", 1e-6, 5e-5, log=True),
#         "weight_decay": trial.suggest_float("weight_decay", 1e-6, 1e-1, log=True),
#         "batch_size": trial.suggest_categorical("batch_size", [4, 8, 16, 32, 64]),
#         "epochs": FIXED_EPOCHS,
#         "patience": FIXED_PATIENCE,
#         "trial_number": trial.number,
#     }
#     path, val_metrics = train_one_run(params)
#     # report intermediate value for pruning if enabled
#     trial.report(val_metrics["f1"], step=1)
#     return val_metrics["f1"]

# study = optuna.create_study(direction="maximize")
# study.optimize(objective, n_trials=TRIALS, show_progress_bar=True)
# print("Best trial:", study.best_trial.number, "F1:", study.best_value)
# best_params = {"run_name": f"{BASE_RUN_NAME}_best_optuna", **study.best_trial.params}

# # Retrain best config to get a clean checkpoint
# best_ckpt, _ = train_one_run(best_params)
# best_path = best_ckpt

# # -------------------------
# # Final evaluation on TEST (+ W&B logging)
# # -------------------------
# model = build_model(best_params["num_unfreeze_last_layers"])
# model.load_state_dict(torch.load(best_path, map_location=DEVICE))
# model.eval()

# all_preds, all_labels = [], []
# with torch.no_grad():
#     for batch in test_loader:
#         batch = {k: v.to(DEVICE) for k, v in batch.items()}
#         logits = model(**batch).logits
#         all_preds.extend(torch.argmax(logits, dim=-1).detach().cpu().tolist())
#         all_labels.extend(batch["labels"].detach().cpu().tolist())

# acc = accuracy_score(all_labels, all_preds)
# p, r, f1, _ = precision_recall_fscore_support(all_labels, all_preds, average="macro", zero_division=0)
# print(f"\nTEST | acc={acc:.4f} | f1_macro={f1:.4f} | precision_macro={p:.4f} | recall_macro={r:.4f}\n")

# print("Per-class report (ids map to labels):")
# print(ID2LABEL)
# report = classification_report(
#     all_labels, all_preds,
#     target_names=[ID2LABEL[i] for i in range(len(ORDER))],
#     zero_division=0, output_dict=True
# )
# print(classification_report(
#     all_labels, all_preds,
#     target_names=[ID2LABEL[i] for i in range(len(ORDER))],
#     zero_division=0
# ))

# # ---- W&B: log test metrics, per-class scores, and confusion matrix ----
# test_run = wandb.init(project=PROJECT, name=f"{BASE_RUN_NAME}_test", resume="allow", reinit=True)
# log_payload = {
#     "test/acc": acc,
#     "test/precision_macro": p,
#     "test/recall_macro": r,
#     "test/f1_macro": f1,
# }
# for cls_name in ORDER:
#     if cls_name in report:
#         log_payload[f"test/{cls_name}/precision"] = report[cls_name]["precision"]
#         log_payload[f"test/{cls_name}/recall"]    = report[cls_name]["recall"]
#         log_payload[f"test/{cls_name}/f1"]        = report[cls_name]["f1-score"]

# wandb.log(log_payload)

# cm = confusion_matrix(all_labels, all_preds, labels=list(range(len(ORDER))))
# wandb.log({
#     "test/confusion_matrix": wandb.plot.confusion_matrix(
#         y_true=all_labels,
#         preds=all_preds,
#         class_names=[ID2LABEL[i] for i in range(len(ID2LABEL))]
#     )
# })
# test_run.summary["best_checkpoint_path"] = best_path
# test_run.summary["test_f1_macro"] = f1
# wandb.finish()

# 🔧 Second Hyperparameter Tuning – Testing Bigger Batches & Higher LRs

From the **first tuning round**, we figured out that:  
- ✅ Unfreezing **all 12 layers** works best  
- ✅ LR around **3e-5** was the sweet spot  

But we weren’t fully sure how far we can push things, so in this round we tried a **wider search** to see what happens.

---

## ⚙️ Changes we made
- **Batch size:** tried `[4, 8, 16, 32, 64]` → to check if bigger batches give more stable training, or if small ones still generalize better.  
- **Learning rate:** expanded up to `1e-2` 🤯 → just to test if more aggressive fine-tuning can still converge.  
- **Unfreezing:** kept the range at **8–12 layers**, since deeper unfreezing looked clearly better last time.  
- **Weight decay:** limited to `1e-6 – 1e-4`, because we saw high values ruin performance.  

---

## 🎯 Why we did this
- To see if **large batches** speed things up without hurting F1.  
- To test whether **big LRs** can still work with proper scheduling.  
- To confirm if **full unfreezing** keeps outperforming partial freezing.  

---

## 📝 What we expect
Honestly, we think the **lower LR region (1e-5 – 1e-4)** will still win 🏆,  
but this round should give us a better sense of the **stability limits** when scaling both **batch size** and **learning rate**.  


In [4]:
from sklearn.model_selection import train_test_split
# Load CSVs (columns: ['UserName','ScreenName','Location','TweetAt','OriginalTweet','Sentiment'])
TRAIN_CSV = "Corona_NLP_train_cleaned_translated.csv"   # or "Corona_NLP_train.csv"
TEST_CSV  = "Corona_NLP_test_cleaned_translated.csv"    # or "Corona_NLP_test.csv"


df_train = pd.read_csv(TRAIN_CSV, encoding="utf-8", engine="python")
df_test  = pd.read_csv(TEST_CSV,  encoding="utf-8", engine="python")

In [5]:
# =========================
# ADV DL – Part B: Monolingual baseline (RoBERTa) – Exercise-4 style
# Custom loop + early stopping + W&B + Optuna ONLY; freeze base, unfreeze last k layers
# Uses df_train / df_test with columns: OriginalTweet (str), Sentiment (str)
# =========================

import os, math, random, time, json
from typing import Dict, List, Tuple

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, classification_report, confusion_matrix
import torch
from torch.utils.data import Dataset, DataLoader
from torch.cuda.amp import autocast, GradScaler

# ---- deps ----
# !pip -q install transformers==4.43.3 optuna==3.6.1 wandb==0.17.5 >/dev/null

import transformers
from transformers import (
    AutoTokenizer, AutoModelForSequenceClassification,
    DataCollatorWithPadding, get_linear_schedule_with_warmup
)
import os
os.environ["TRANSFORMERS_NO_TF"] = "1"
os.environ["TRANSFORMERS_NO_FLAX"] = "1"

import optuna
import wandb

# -------------------------
# Constants (no CFG, Optuna-only workflow)
# -------------------------
MODEL_NAME = "microsoft/mdeberta-v3-base"
MAX_LEN = 512
BATCH_SIZE = 16
WARMUP_RATIO = 0.06
GRAD_CLIP = 1.0
USE_AMP = True
# PROJECT = "adv-dl-p2"
PROJECT = "adv-dl-p2-deberta-full"

BASE_RUN_NAME = "microsoft/mdeberta-v3-base_full_ex_4"
TRIALS = 20
SEED = 42
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

def set_seed(seed=42):
    random.seed(seed); np.random.seed(seed)
    torch.manual_seed(seed); torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

set_seed(SEED)

# ---- GPU perf toggles (Windows-safe) ----
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True
try:
    torch.set_float32_matmul_precision("high")
except Exception:
    pass

# -------------------------
# Label mapping (5-way sentiment)
# -------------------------
CANON = {
    "extremely negative": "extremely negative",
    "negative": "negative",
    "neutral": "neutral",
    "positive": "positive",
    "extremely positive": "extremely positive",
}
ORDER = ["extremely negative","negative","neutral","positive","extremely positive"]
LABEL2ID = {lab: i for i, lab in enumerate(ORDER)}
ID2LABEL = {i: lab for lab, i in LABEL2ID.items()}

def normalize_label(s: str) -> str:
    s = str(s).strip().lower()
    s = s.replace("very negative", "extremely negative")
    s = s.replace("very positive", "extremely positive")
    s = s.replace("extreme negative", "extremely negative")
    s = s.replace("extreme positive", "extremely positive")
    return CANON.get(s, s)

# -------------------------
# Expect df_train, df_test in memory
# -------------------------
assert "OriginalTweet" in df_train.columns and "Sentiment" in df_train.columns, "df_train missing required columns"
assert "OriginalTweet" in df_test.columns and "Sentiment" in df_test.columns, "df_test missing required columns"

def prep_df(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    df = df.dropna(subset=["OriginalTweet", "Sentiment"])
    df["text"] = df["OriginalTweet"].astype(str).str.strip()
    df["label_name"] = df["Sentiment"].apply(normalize_label)
    df = df[df["label_name"].isin(ORDER)].reset_index(drop=True)
    df["label"] = df["label_name"].map(LABEL2ID)
    return df[["text", "label", "label_name"]]

dftrain_ = prep_df(df_train)
dftest_  = prep_df(df_test)

train_df, val_df = train_test_split(
    dftrain_, test_size=0.1, stratify=dftrain_["label"], random_state=SEED
)
print(f"Train/Val/Test sizes: {len(train_df)}/{len(val_df)}/{len(dftest_)}")

# -------------------------
# Dataset & Collator
# -------------------------
class TweetDataset(Dataset):
    def __init__(self, df: pd.DataFrame, tokenizer: transformers.PreTrainedTokenizerBase, max_len: int):
        self.texts = df["text"].tolist()
        self.labels = df["label"].tolist()
        self.tok = tokenizer
        self.max_len = max_len
    def __len__(self): return len(self.texts)
    def __getitem__(self, idx):
        enc = self.tok(self.texts[idx], truncation=True, max_length=self.max_len, padding=False)
        enc["labels"] = self.labels[idx]
        return {k: torch.tensor(v) for k, v in enc.items()}

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)
train_ds = TweetDataset(train_df, tokenizer, MAX_LEN)
val_ds   = TweetDataset(val_df, tokenizer, MAX_LEN)
test_ds  = TweetDataset(dftest_, tokenizer, MAX_LEN)


BATCH_SIZE=16
# ---- pad_to_multiple_of=8 for Tensor Cores; Windows: workers=0 is often faster ----
collate_fn = DataCollatorWithPadding(tokenizer=tokenizer, pad_to_multiple_of=8)
train_loader = DataLoader(train_ds, batch_size=BATCH_SIZE, shuffle=True,  collate_fn=collate_fn, num_workers=0, pin_memory=True, persistent_workers=False)
val_loader   = DataLoader(val_ds,   batch_size=BATCH_SIZE, shuffle=False, collate_fn=collate_fn, num_workers=0, pin_memory=True, persistent_workers=False)
test_loader  = DataLoader(test_ds,  batch_size=BATCH_SIZE, shuffle=False, collate_fn=collate_fn, num_workers=0, pin_memory=True, persistent_workers=False)

# -------------------------
# Model & Freeze/Unfreeze strategy
# -------------------------
def build_model(num_unfreeze_last_layers: int = 4):
    model = AutoModelForSequenceClassification.from_pretrained(
        MODEL_NAME, num_labels=len(ORDER), id2label=ID2LABEL, label2id=LABEL2ID
    )
    base = getattr(model, "roberta", None) or getattr(model, "bert", None) or getattr(model, "deberta", None)
    if base is not None:
        for p in base.parameters(): p.requires_grad = False
        if hasattr(base, "encoder") and hasattr(base.encoder, "layer"):
            k = num_unfreeze_last_layers
            if k > 0:
                for layer in base.encoder.layer[-k:]:
                    for p in layer.parameters(): p.requires_grad = True
    for p in model.classifier.parameters(): p.requires_grad = True
    return model.to(DEVICE)

# -------------------------
# Train / Eval utilities
# -------------------------
def get_optimizer_scheduler(model, num_training_steps: int, lr: float, weight_decay: float):
    no_decay = ["bias", "LayerNorm.weight"]
    optimizer_grouped_parameters = [
        {"params": [p for n, p in model.named_parameters() if p.requires_grad and not any(nd in n for nd in no_decay)], "weight_decay": weight_decay},
        {"params": [p for n, p in model.named_parameters() if p.requires_grad and any(nd in n for nd in no_decay)],  "weight_decay": 0.0},
    ]
    # try fused AdamW on CUDA (faster step) — falls back if unavailable
    try:
        optimizer = torch.optim.AdamW(optimizer_grouped_parameters, lr=lr, weight_decay=weight_decay, fused=(DEVICE=="cuda"))
    except TypeError:
        optimizer = torch.optim.AdamW(optimizer_grouped_parameters, lr=lr, weight_decay=weight_decay)
    num_warmup = int(num_training_steps * WARMUP_RATIO)
    scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=num_warmup, num_training_steps=num_training_steps)
    return optimizer, scheduler

def evaluate(model, loader) -> Dict[str, float]:
    model.eval()
    preds, labels = [], []
    with torch.no_grad():
        for batch in loader:
            batch = {k: v.to(DEVICE, non_blocking=True) for k, v in batch.items()}
            # AMP autocast for faster eval math
            with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
                          enabled=(DEVICE == "cuda" and USE_AMP)):
                logits = model(**batch).logits
            preds.extend(torch.argmax(logits, dim=-1).detach().cpu().tolist())
            labels.extend(batch["labels"].detach().cpu().tolist())
    acc = accuracy_score(labels, preds)
    p, r, f1, _ = precision_recall_fscore_support(labels, preds, average="macro", zero_division=0)
    return {"acc": acc, "precision": p, "recall": r, "f1": f1}

def train_one_run(hp: Dict) -> Tuple[str, Dict[str, float]]:
    """
    hp keys: run_name, num_unfreeze_last_layers, lr, weight_decay, epochs, patience, trial_number
    """
    run_name = hp["run_name"]
    num_unfreeze = int(hp["num_unfreeze_last_layers"])
    lr = float(hp["lr"])
    wd = float(hp["weight_decay"])
    # epochs = int(hp["epochs"])
    # patience = int(hp["patience"])
    epochs   = int(hp.get("epochs",   FIXED_EPOCHS))
    patience = int(hp.get("patience", FIXED_PATIENCE))
    model = build_model(num_unfreeze)
    total_steps = int(math.ceil(len(train_loader) * epochs))
    optimizer, scheduler = get_optimizer_scheduler(model, total_steps, lr, wd)

    scaler = GradScaler(enabled=(DEVICE == "cuda" and USE_AMP))
    best_metric = -1.0
    # best_path = f"best_{run_name}.pt"
    no_improve = 0
    # make run_name safe for filesystem paths and ensure a folder exists
    safe_run_name = run_name.replace("/", "__").replace("\\", "__")
    ckpt_dir = "checkpoints"
    os.makedirs(ckpt_dir, exist_ok=True)
    best_path = os.path.join(ckpt_dir, f"best_{safe_run_name}.pt")

    wandb_run = wandb.init(
        project=PROJECT,
        name=run_name,
        config={
            "model": MODEL_NAME,
            "max_len": MAX_LEN,
            "batch_size": BATCH_SIZE,
            "epochs": epochs,
            "lr": lr,
            "weight_decay": wd,
            "warmup_ratio": WARMUP_RATIO,
            "grad_clip": GRAD_CLIP,
            "num_unfreeze_last_layers": num_unfreeze,
            "trial_number": hp.get("trial_number", -1),
            "suggested_batch_size": hp.get("batch_size", BATCH_SIZE),
        },
        reinit=True,
    )

    # nicer W&B charts
    wandb.define_metric("epoch")
    wandb.define_metric("step")
    wandb.define_metric("train/*", step_metric="step")
    wandb.define_metric("val/*",   step_metric="epoch")

    # print + log trainable params
    total_params     = sum(p.numel() for p in model.parameters())
    trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
    print(f"Trainable params: {trainable_params:,} / {total_params:,} "
          f"({100.0*trainable_params/total_params:.2f}%) ; unfreeze_last_k={num_unfreeze}")
    wandb.log({"params/total": total_params,
               "params/trainable": trainable_params,
               "params/ratio": trainable_params/max(1,total_params)}, step=0)

    for epoch in range(epochs):
        model.train()
        t0 = time.time()
        running_loss = 0.0

        for step, batch in enumerate(train_loader):
            batch = {k: v.to(DEVICE, non_blocking=True) for k, v in batch.items()}
            optimizer.zero_grad(set_to_none=True)
            # use BF16 if supported; else FP16
            with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
                          enabled=(DEVICE == "cuda" and USE_AMP)):
                outputs = model(**batch)
                loss = outputs.loss
            scaler.scale(loss).backward()
            if GRAD_CLIP is not None:
                scaler.unscale_(optimizer)
                torch.nn.utils.clip_grad_norm_(model.parameters(), GRAD_CLIP)
            scaler.step(optimizer); scaler.update(); scheduler.step()
            running_loss += loss.item()

            if step % 20 == 0:
                wandb.log({"train/loss": loss.item(), "step": step + 1, "epoch": epoch + 1})

            # periodic console + throughput log (about 10x per epoch)
            if step % max(1, len(train_loader)//10) == 0 or step == 1:
                avg_loss = running_loss / max(1, (step + 1))
                elapsed  = time.time() - t0
                items    = (step + 1) * BATCH_SIZE
                itps     = items / max(elapsed, 1e-6)
                print(f"[e{epoch+1} b{step+1}/{len(train_loader)}] loss={loss.item():.4f} avg={avg_loss:.4f} it/s={itps:.1f}")
                wandb.log({"train/avg_loss_so_far": avg_loss,
                           "train/items_per_sec": itps,
                           "step": (epoch * len(train_loader)) + (step + 1),
                           "epoch": epoch + 1})

        # epoch-end validation
        val_metrics = evaluate(model, val_loader)
        elapsed = time.time() - t0

        epoch_loss = running_loss / max(1, len(train_loader))
        current_lr = scheduler.get_last_lr()[0]
        wandb.log({
            "train/epoch_loss": epoch_loss,
            "val/acc": val_metrics["acc"],
            "val/precision": val_metrics["precision"],
            "val/recall": val_metrics["recall"],
            "val/f1": val_metrics["f1"],
            "lr": current_lr,
            "time/epoch_sec": elapsed,
            "epoch": epoch + 1,
        })

        # Early stopping on val f1
        if val_metrics["f1"] > best_metric:
            best_metric = val_metrics["f1"]
            torch.save(model.state_dict(), best_path)
            no_improve = 0
            wandb_run.summary["best_val_f1"] = best_metric
            wandb_run.summary["best_checkpoint_path"] = best_path
            wandb.log({"val/best_f1_so_far": best_metric, "val/best_epoch": epoch + 1})
        else:
            no_improve += 1
            if no_improve >= patience:
                print(f"Early stopping at epoch {epoch+1}")
                break

        print(f"Epoch {epoch+1}/{epochs} | "
              f"loss={epoch_loss:.4f} | "
              f"val_acc={val_metrics['acc']:.4f} | val_f1={val_metrics['f1']:.4f} | time={elapsed:.1f}s")

    wandb.finish()

    # Load best and return path + metrics on val for reference
    model.load_state_dict(torch.load(best_path, map_location=DEVICE))
    final_val = evaluate(model, val_loader)

    # store final val in W&B summary for quick sorting
    if wandb.run is not None:
        wandb.run.summary["final_val_acc"] = final_val["acc"]
        wandb.run.summary["final_val_precision"] = final_val["precision"]
        wandb.run.summary["final_val_recall"] = final_val["recall"]
        wandb.run.summary["final_val_f1"] = final_val["f1"]

    return best_path, final_val

# -------------------------
# Optuna hyperparameter tuning (ALWAYS ON)
# -------------------------

# Constants
FIXED_EPOCHS = 12
FIXED_PATIENCE = 4

def objective(trial: optuna.trial.Trial):
    params = {
        "run_name": f"{BASE_RUN_NAME}_optuna_trial_{trial.number}",
        "num_unfreeze_last_layers": trial.suggest_int("num_unfreeze_last_layers", 8, 12),
        "lr": trial.suggest_float("lr", 1e-4, 1e-2, log=True),
        "weight_decay": trial.suggest_float("weight_decay", 1e-6, 1e-4, log=True),
        "batch_size": trial.suggest_categorical("batch_size", [4, 8, 16, 32, 64]),
        "epochs": FIXED_EPOCHS,
        "patience": FIXED_PATIENCE,
        "trial_number": trial.number,
    }
    path, val_metrics = train_one_run(params)
    # console visibility per trial
    print(f"[Trial {trial.number}] f1={val_metrics['f1']:.4f} | "
          f"unfreeze_k={params['num_unfreeze_last_layers']} lr={params['lr']:.2e} "
          f"wd={params['weight_decay']:.1e} suggested_bs={params['batch_size']}")
    # report intermediate value for pruning if enabled
    trial.report(val_metrics["f1"], step=1)
    return val_metrics["f1"]



Train/Val/Test sizes: 37039/4116/3798




### Run Study

In [5]:
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=TRIALS, show_progress_bar=True)
print("Best trial:", study.best_trial.number, "F1:", study.best_value)
best_params = {"run_name": f"{BASE_RUN_NAME}_best_optuna", **study.best_trial.params}



[I 2025-08-16 02:23:31,029] A new study created in memory with name: no-name-b0cc087a-195e-454e-8e52-0f0bae1b98f4
  0%|          | 0/20 [00:00<?, ?it/s]Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/mdeberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  return t.to(
  scaler = GradScaler(enabled=(DEVICE == "cuda" and USE_AMP))
[34m[1mwandb[0m: [32m[41mERROR[0m Failed to detect the name of this notebook. You can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33madishalit1[0m ([33madishalit1-tel-aviv-university[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Trainable params: 85,648,901 / 278,813,189 (30.72%) ; unfreeze_last_k=12


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e1 b1/2315] loss=1.6808 avg=1.6808 it/s=38.9
[e1 b2/2315] loss=1.6321 avg=1.6564 it/s=60.4
[e1 b232/2315] loss=1.5153 avg=1.5628 it/s=329.1
[e1 b463/2315] loss=1.1529 avg=1.4154 it/s=340.2
[e1 b694/2315] loss=1.0651 avg=1.3113 it/s=349.1
[e1 b925/2315] loss=1.4492 avg=1.2348 it/s=352.0
[e1 b1156/2315] loss=0.6699 avg=1.1797 it/s=352.3
[e1 b1387/2315] loss=1.5224 avg=1.2216 it/s=338.3
[e1 b1618/2315] loss=1.5436 avg=1.2721 it/s=326.3
[e1 b1849/2315] loss=1.6322 avg=1.3118 it/s=329.8
[e1 b2080/2315] loss=1.6153 avg=1.3416 it/s=332.6
[e1 b2311/2315] loss=1.6319 avg=1.3653 it/s=333.9


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 1/12 | loss=1.3656 | val_acc=0.2775 | val_f1=0.0869 | time=115.5s
[e2 b1/2315] loss=1.4302 avg=1.4302 it/s=333.8
[e2 b2/2315] loss=1.5803 avg=1.5053 it/s=366.2


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e2 b232/2315] loss=1.5652 avg=1.5686 it/s=349.4
[e2 b463/2315] loss=1.6740 avg=1.5773 it/s=338.0
[e2 b694/2315] loss=1.5795 avg=1.5784 it/s=344.3
[e2 b925/2315] loss=1.5484 avg=1.5786 it/s=348.7
[e2 b1156/2315] loss=1.5892 avg=1.5781 it/s=352.7
[e2 b1387/2315] loss=1.6029 avg=1.5772 it/s=351.6
[e2 b1618/2315] loss=1.5469 avg=1.5774 it/s=349.4
[e2 b1849/2315] loss=1.5714 avg=1.5775 it/s=346.4
[e2 b2080/2315] loss=1.5982 avg=1.5784 it/s=345.2
[e2 b2311/2315] loss=1.6026 avg=1.5784 it/s=344.0


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 2/12 | loss=1.5784 | val_acc=0.2775 | val_f1=0.0869 | time=112.4s
[e3 b1/2315] loss=1.5432 avg=1.5432 it/s=358.2
[e3 b2/2315] loss=1.4834 avg=1.5133 it/s=372.5


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e3 b232/2315] loss=1.5902 avg=1.5833 it/s=335.7
[e3 b463/2315] loss=1.5635 avg=1.5781 it/s=342.5
[e3 b694/2315] loss=1.5055 avg=1.5775 it/s=344.7
[e3 b925/2315] loss=1.5748 avg=1.5787 it/s=349.3
[e3 b1156/2315] loss=1.6232 avg=1.5773 it/s=351.9
[e3 b1387/2315] loss=1.4228 avg=1.5772 it/s=351.6
[e3 b1618/2315] loss=1.5484 avg=1.5768 it/s=346.4
[e3 b1849/2315] loss=1.5915 avg=1.5767 it/s=343.4
[e3 b2080/2315] loss=1.6284 avg=1.5774 it/s=341.9
[e3 b2311/2315] loss=1.6951 avg=1.5776 it/s=342.5


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 3/12 | loss=1.5776 | val_acc=0.2775 | val_f1=0.0869 | time=112.8s
[e4 b1/2315] loss=1.5595 avg=1.5595 it/s=341.1
[e4 b2/2315] loss=1.6092 avg=1.5844 it/s=300.8


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e4 b232/2315] loss=1.4839 avg=1.5767 it/s=351.4
[e4 b463/2315] loss=1.4571 avg=1.5799 it/s=348.7
[e4 b694/2315] loss=1.6158 avg=1.5780 it/s=349.6
[e4 b925/2315] loss=1.6341 avg=1.5772 it/s=350.9
[e4 b1156/2315] loss=1.4867 avg=1.5782 it/s=354.2
[e4 b1387/2315] loss=1.5687 avg=1.5774 it/s=355.6
[e4 b1618/2315] loss=1.4918 avg=1.5774 it/s=353.3
[e4 b1849/2315] loss=1.3948 avg=1.5768 it/s=349.0
[e4 b2080/2315] loss=1.5559 avg=1.5766 it/s=347.6
[e4 b2311/2315] loss=1.6125 avg=1.5772 it/s=346.7


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 4/12 | loss=1.5772 | val_acc=0.2410 | val_f1=0.0777 | time=111.6s
[e5 b1/2315] loss=1.5629 avg=1.5629 it/s=371.3
[e5 b2/2315] loss=1.5717 avg=1.5673 it/s=320.8


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e5 b232/2315] loss=1.5188 avg=1.5754 it/s=339.7
[e5 b463/2315] loss=1.7075 avg=1.5738 it/s=338.1
[e5 b694/2315] loss=1.6032 avg=1.5753 it/s=329.9
[e5 b925/2315] loss=1.5946 avg=1.5761 it/s=329.3
[e5 b1156/2315] loss=1.6991 avg=1.5766 it/s=334.4
[e5 b1387/2315] loss=1.6866 avg=1.5752 it/s=339.0
[e5 b1618/2315] loss=1.5170 avg=1.5754 it/s=339.6
[e5 b1849/2315] loss=1.6717 avg=1.5767 it/s=337.6
[e5 b2080/2315] loss=1.5221 avg=1.5767 it/s=334.9
[e5 b2311/2315] loss=1.6045 avg=1.5769 it/s=334.5


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
[34m[1mwandb[0m: [32m[41mERROR[0m The nbformat package was not found. It is required to save notebook history.


Early stopping at epoch 5


0,1
epoch,▁▁▁▁▁▁▁▁▁▃▃▃▃▃▃▃▃▃▃▃▅▅▅▅▅▅▅▅▅▅▆▆▆▆▆▆████
lr,█▆▅▃▁
params/ratio,▁
params/total,▁
params/trainable,▁
step,▁▁▁▁▁▂▂▂▂▃▂▂▂▂▂▂▂▁▁▂▂▂▂▂▃▁▁▂▂▂▂▂▇▁▁▁█▂▂▂
time/epoch_sec,█▂▃▁█
train/avg_loss_so_far,██▄▂▁▂▃▄▅▆▇▇▇▇▇▆▆▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▆▆▇▇▇▇▇▇
train/epoch_loss,▁████
train/items_per_sec,▁▇█▇▇▇▇▇█▇▇▇█▇▇█▇▇▇█▇▇▇▇▇█▇███▇▇█▇▇▇▇▇▇▇

0,1
best_checkpoint_path,checkpoints\best_mic...
best_val_f1,0.08688
epoch,5
lr,9e-05
params/ratio,0.30719
params/total,278813189
params/trainable,85648901
step,11571
time/epoch_sec,115.52365
train/avg_loss_so_far,1.57691


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
Best trial: 0. Best value: 0.0868771:   5%|▌         | 1/20 [09:38<3:03:17, 578.81s/it]

[Trial 0] f1=0.0869 | unfreeze_k=12 lr=1.45e-04 wd=4.9e-06 suggested_bs=16
[I 2025-08-16 02:33:09,835] Trial 0 finished with value: 0.08687713959680486 and parameters: {'num_unfreeze_last_layers': 12, 'lr': 0.00014543889726670146, 'weight_decay': 4.883984881682272e-06, 'batch_size': 16}. Best is trial 0 with value: 0.08687713959680486.


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/mdeberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  scaler = GradScaler(enabled=(DEVICE == "cuda" and USE_AMP))


Trainable params: 85,648,901 / 278,813,189 (30.72%) ; unfreeze_last_k=12
[e1 b1/2315] loss=1.6378 avg=1.6378 it/s=116.7


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e1 b2/2315] loss=1.6414 avg=1.6396 it/s=153.0
[e1 b232/2315] loss=1.6606 avg=1.6056 it/s=328.4
[e1 b463/2315] loss=1.5806 avg=1.5987 it/s=327.3
[e1 b694/2315] loss=1.5971 avg=1.5942 it/s=333.3
[e1 b925/2315] loss=1.4688 avg=1.5925 it/s=334.0
[e1 b1156/2315] loss=1.5293 avg=1.5917 it/s=336.1
[e1 b1387/2315] loss=1.6028 avg=1.5904 it/s=339.0
[e1 b1618/2315] loss=1.6762 avg=1.5895 it/s=337.7
[e1 b1849/2315] loss=1.5924 avg=1.5880 it/s=337.1
[e1 b2080/2315] loss=1.5871 avg=1.5869 it/s=335.6
[e1 b2311/2315] loss=1.6652 avg=1.5862 it/s=335.8


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 1/12 | loss=1.5861 | val_acc=0.2775 | val_f1=0.0869 | time=114.9s
[e2 b1/2315] loss=1.6434 avg=1.6434 it/s=337.1
[e2 b2/2315] loss=1.6566 avg=1.6500 it/s=365.4


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e2 b232/2315] loss=1.6233 avg=1.5693 it/s=350.1
[e2 b463/2315] loss=1.5549 avg=1.5736 it/s=345.0
[e2 b694/2315] loss=1.5481 avg=1.5731 it/s=340.6
[e2 b925/2315] loss=1.5167 avg=1.5750 it/s=338.4
[e2 b1156/2315] loss=1.5878 avg=1.5763 it/s=339.8
[e2 b1387/2315] loss=1.5893 avg=1.5756 it/s=343.3
[e2 b1618/2315] loss=1.5569 avg=1.5747 it/s=345.5
[e2 b1849/2315] loss=1.6099 avg=1.5751 it/s=342.7
[e2 b2080/2315] loss=1.5239 avg=1.5750 it/s=341.4
[e2 b2311/2315] loss=1.6262 avg=1.5756 it/s=339.2


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 2/12 | loss=1.5756 | val_acc=0.2775 | val_f1=0.0869 | time=113.9s
[e3 b1/2315] loss=1.5289 avg=1.5289 it/s=312.7
[e3 b2/2315] loss=1.5566 avg=1.5428 it/s=328.3


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e3 b232/2315] loss=1.6162 avg=1.5681 it/s=343.6
[e3 b463/2315] loss=1.6500 avg=1.5717 it/s=345.8
[e3 b694/2315] loss=1.6132 avg=1.5758 it/s=342.2
[e3 b925/2315] loss=1.5029 avg=1.5753 it/s=341.7
[e3 b1156/2315] loss=1.5983 avg=1.5734 it/s=342.7
[e3 b1387/2315] loss=1.6592 avg=1.5744 it/s=345.5
[e3 b1618/2315] loss=1.5318 avg=1.5753 it/s=347.4
[e3 b1849/2315] loss=1.5763 avg=1.5759 it/s=347.5
[e3 b2080/2315] loss=1.5539 avg=1.5755 it/s=344.0
[e3 b2311/2315] loss=1.5121 avg=1.5754 it/s=341.8


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 3/12 | loss=1.5753 | val_acc=0.2775 | val_f1=0.0869 | time=113.1s
[e4 b1/2315] loss=1.6503 avg=1.6503 it/s=331.5
[e4 b2/2315] loss=1.5532 avg=1.6017 it/s=320.3


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e4 b232/2315] loss=1.6231 avg=1.5665 it/s=335.4
[e4 b463/2315] loss=1.5951 avg=1.5695 it/s=330.6
[e4 b694/2315] loss=1.5835 avg=1.5723 it/s=331.9
[e4 b925/2315] loss=1.6245 avg=1.5746 it/s=330.4
[e4 b1156/2315] loss=1.5473 avg=1.5751 it/s=330.1
[e4 b1387/2315] loss=1.5452 avg=1.5755 it/s=333.2
[e4 b1618/2315] loss=1.6270 avg=1.5753 it/s=335.7
[e4 b1849/2315] loss=1.6238 avg=1.5759 it/s=338.6
[e4 b2080/2315] loss=1.7159 avg=1.5753 it/s=338.3
[e4 b2311/2315] loss=1.6571 avg=1.5752 it/s=337.1


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 4/12 | loss=1.5752 | val_acc=0.2775 | val_f1=0.0869 | time=114.7s
[e5 b1/2315] loss=1.6238 avg=1.6238 it/s=362.2
[e5 b2/2315] loss=1.4971 avg=1.5604 it/s=339.0


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e5 b232/2315] loss=1.5562 avg=1.5730 it/s=338.9
[e5 b463/2315] loss=1.6033 avg=1.5711 it/s=341.6
[e5 b694/2315] loss=1.5936 avg=1.5730 it/s=341.0
[e5 b925/2315] loss=1.5695 avg=1.5741 it/s=341.7
[e5 b1156/2315] loss=1.6179 avg=1.5745 it/s=342.0
[e5 b1387/2315] loss=1.7078 avg=1.5744 it/s=341.5
[e5 b1618/2315] loss=1.7004 avg=1.5757 it/s=343.7
[e5 b1849/2315] loss=1.5581 avg=1.5755 it/s=344.9
[e5 b2080/2315] loss=1.5764 avg=1.5753 it/s=345.5
[e5 b2311/2315] loss=1.6550 avg=1.5753 it/s=345.0


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
[34m[1mwandb[0m: [32m[41mERROR[0m The nbformat package was not found. It is required to save notebook history.


Early stopping at epoch 5


0,1
epoch,▁▁▁▁▁▁▁▁▁▃▃▃▃▃▃▅▅▅▅▅▅▅▅▅▆▆▆▆▆▆▆▆████████
lr,█▆▄▃▁
params/ratio,▁
params/total,▁
params/trainable,▁
step,▁▁▂▂▂▂▃▃▁▁▂▂▂▂▂▂▁▅▁▂▂▂▂▂▂▃▃▆▆▁▂█▃█▃▁▂▂▃▃
time/epoch_sec,█▅▃▇▁
train/avg_loss_so_far,▇▇▅▅▅▄▄▄██▄▄▄▄▄▁▂▃▄▄▄▄▄▄█▃▃▄▄▄▄▄▆▃▄▄▄▄▄▄
train/epoch_loss,█▁▁▁▁
train/items_per_sec,▁▇▇▇▇▇▇▇██▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇█▇▇▇▇▇▇▇▇

0,1
best_checkpoint_path,checkpoints\best_mic...
best_val_f1,0.08688
epoch,5
lr,0.00213
params/ratio,0.30719
params/total,278813189
params/trainable,85648901
step,11571
time/epoch_sec,112.23571
train/avg_loss_so_far,1.57527


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
Best trial: 0. Best value: 0.0868771:  10%|█         | 2/20 [19:17<2:53:33, 578.52s/it]

[Trial 1] f1=0.0869 | unfreeze_k=12 lr=3.43e-03 wd=7.8e-05 suggested_bs=4
[I 2025-08-16 02:42:48,167] Trial 1 finished with value: 0.08687713959680486 and parameters: {'num_unfreeze_last_layers': 12, 'lr': 0.003433856232028575, 'weight_decay': 7.805045784095608e-05, 'batch_size': 4}. Best is trial 0 with value: 0.08687713959680486.


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/mdeberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  scaler = GradScaler(enabled=(DEVICE == "cuda" and USE_AMP))


Trainable params: 78,561,029 / 278,813,189 (28.18%) ; unfreeze_last_k=11
[e1 b1/2315] loss=1.5846 avg=1.5846 it/s=238.5
[e1 b2/2315] loss=1.5538 avg=1.5692 it/s=246.9


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e1 b232/2315] loss=1.5739 avg=1.5397 it/s=361.7
[e1 b463/2315] loss=1.5558 avg=1.5122 it/s=365.0
[e1 b694/2315] loss=1.5310 avg=1.5372 it/s=365.7
[e1 b925/2315] loss=1.5938 avg=1.5501 it/s=362.5
[e1 b1156/2315] loss=1.5692 avg=1.5569 it/s=358.9
[e1 b1387/2315] loss=1.4163 avg=1.5618 it/s=359.1
[e1 b1618/2315] loss=1.6138 avg=1.5650 it/s=361.7
[e1 b1849/2315] loss=1.6691 avg=1.5678 it/s=364.1
[e1 b2080/2315] loss=1.5256 avg=1.5690 it/s=366.1
[e1 b2311/2315] loss=1.6005 avg=1.5705 it/s=365.0


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 1/12 | loss=1.5705 | val_acc=0.2775 | val_f1=0.0869 | time=106.1s
[e2 b1/2315] loss=1.4957 avg=1.4957 it/s=310.3
[e2 b2/2315] loss=1.6501 avg=1.5729 it/s=357.1


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e2 b232/2315] loss=1.5008 avg=1.5751 it/s=335.0
[e2 b463/2315] loss=1.6344 avg=1.5756 it/s=346.9
[e2 b694/2315] loss=1.5768 avg=1.5769 it/s=338.7
[e2 b925/2315] loss=1.5013 avg=1.5742 it/s=335.6
[e2 b1156/2315] loss=1.5954 avg=1.5767 it/s=338.8
[e2 b1387/2315] loss=1.5392 avg=1.5766 it/s=341.5
[e2 b1618/2315] loss=1.6447 avg=1.5761 it/s=341.8
[e2 b1849/2315] loss=1.6838 avg=1.5766 it/s=345.7
[e2 b2080/2315] loss=1.5803 avg=1.5781 it/s=348.8
[e2 b2311/2315] loss=1.6587 avg=1.5779 it/s=350.2


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 2/12 | loss=1.5780 | val_acc=0.2775 | val_f1=0.0869 | time=110.6s
[e3 b1/2315] loss=1.5475 avg=1.5475 it/s=392.1
[e3 b2/2315] loss=1.5026 avg=1.5251 it/s=305.1


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e3 b232/2315] loss=1.5488 avg=1.5730 it/s=335.9
[e3 b463/2315] loss=1.5503 avg=1.5728 it/s=331.2
[e3 b694/2315] loss=1.5780 avg=1.5778 it/s=332.4
[e3 b925/2315] loss=1.6537 avg=1.5781 it/s=339.5
[e3 b1156/2315] loss=1.6501 avg=1.5789 it/s=342.2
[e3 b1387/2315] loss=1.4196 avg=1.5777 it/s=344.4
[e3 b1618/2315] loss=1.5088 avg=1.5774 it/s=346.7
[e3 b1849/2315] loss=1.5759 avg=1.5769 it/s=348.9
[e3 b2080/2315] loss=1.6033 avg=1.5769 it/s=352.0
[e3 b2311/2315] loss=1.5270 avg=1.5764 it/s=354.5


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 3/12 | loss=1.5765 | val_acc=0.2775 | val_f1=0.0869 | time=109.1s
[e4 b1/2315] loss=1.5310 avg=1.5310 it/s=316.3
[e4 b2/2315] loss=1.6388 avg=1.5849 it/s=323.0


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e4 b232/2315] loss=1.5686 avg=1.5771 it/s=364.0
[e4 b463/2315] loss=1.5602 avg=1.5797 it/s=350.9
[e4 b694/2315] loss=1.5290 avg=1.5782 it/s=350.6
[e4 b925/2315] loss=1.6474 avg=1.5758 it/s=351.0
[e4 b1156/2315] loss=1.4673 avg=1.5741 it/s=353.3
[e4 b1387/2315] loss=1.5972 avg=1.5753 it/s=357.0
[e4 b1618/2315] loss=1.4762 avg=1.5749 it/s=358.9
[e4 b1849/2315] loss=1.5863 avg=1.5752 it/s=357.6
[e4 b2080/2315] loss=1.6037 avg=1.5754 it/s=358.8
[e4 b2311/2315] loss=1.6893 avg=1.5759 it/s=359.4


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 4/12 | loss=1.5760 | val_acc=0.2775 | val_f1=0.0869 | time=107.6s
[e5 b1/2315] loss=1.5547 avg=1.5547 it/s=382.6
[e5 b2/2315] loss=1.6833 avg=1.6190 it/s=354.9


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e5 b232/2315] loss=1.6363 avg=1.5762 it/s=333.3
[e5 b463/2315] loss=1.5842 avg=1.5744 it/s=344.5
[e5 b694/2315] loss=1.5812 avg=1.5752 it/s=344.7
[e5 b925/2315] loss=1.6426 avg=1.5753 it/s=342.3
[e5 b1156/2315] loss=1.5018 avg=1.5750 it/s=344.3
[e5 b1387/2315] loss=1.6505 avg=1.5751 it/s=346.8
[e5 b1618/2315] loss=1.5657 avg=1.5752 it/s=347.9
[e5 b1849/2315] loss=1.5366 avg=1.5752 it/s=347.8
[e5 b2080/2315] loss=1.5946 avg=1.5754 it/s=348.2
[e5 b2311/2315] loss=1.5556 avg=1.5757 it/s=348.2


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
[34m[1mwandb[0m: [32m[41mERROR[0m The nbformat package was not found. It is required to save notebook history.


Early stopping at epoch 5


0,1
epoch,▁▁▁▁▁▁▁▁▁▃▃▃▃▃▃▅▅▅▅▅▅▅▅▅▅▆▆▆▆▆▆█████████
lr,█▆▄▃▁
params/ratio,▁
params/total,▁
params/trainable,▁
step,▁▁▁▁▁▁▁▂▂▂▂▁▁▁▁▂▂▂▂▄▁▁▂▂▅▁▁▆▂▂▂▇▁▁▁▂▂▂▂█
time/epoch_sec,▁▇▅▃█
train/avg_loss_so_far,▆▅▃▂▄▅▅▅▁▅▆▆▅▆▆▄▅▅▆▆▆▆▃▆▆▆▆▅▆▆▄█▆▅▆▆▆▆▆▆
train/epoch_loss,▁█▇▆▆
train/items_per_sec,▁▁▇▇▇▇▇▇▄▇▆▆▆▆▆▆▄▆▆▆▆▆▇▅▅▆▆▇▇▇▇▇█▆▆▆▆▆▆▆

0,1
best_checkpoint_path,checkpoints\best_mic...
best_val_f1,0.08688
epoch,5
lr,0.00068
params/ratio,0.28177
params/total,278813189
params/trainable,78561029
step,11571
time/epoch_sec,111.21472
train/avg_loss_so_far,1.57574


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
Best trial: 0. Best value: 0.0868771:  15%|█▌        | 3/20 [28:31<2:40:44, 567.34s/it]

[Trial 2] f1=0.0869 | unfreeze_k=11 lr=1.10e-03 wd=1.2e-05 suggested_bs=16
[I 2025-08-16 02:52:02,203] Trial 2 finished with value: 0.08687713959680486 and parameters: {'num_unfreeze_last_layers': 11, 'lr': 0.0010978882671672725, 'weight_decay': 1.2213686939219737e-05, 'batch_size': 16}. Best is trial 0 with value: 0.08687713959680486.


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/mdeberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  scaler = GradScaler(enabled=(DEVICE == "cuda" and USE_AMP))


Trainable params: 78,561,029 / 278,813,189 (28.18%) ; unfreeze_last_k=11
[e1 b1/2315] loss=1.7347 avg=1.7347 it/s=248.3
[e1 b2/2315] loss=1.6141 avg=1.6744 it/s=283.5


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e1 b232/2315] loss=1.2967 avg=1.5205 it/s=370.9
[e1 b463/2315] loss=1.6420 avg=1.5107 it/s=369.3
[e1 b694/2315] loss=1.5959 avg=1.5375 it/s=350.9
[e1 b925/2315] loss=1.6525 avg=1.5479 it/s=345.5
[e1 b1156/2315] loss=1.5897 avg=1.5551 it/s=345.5
[e1 b1387/2315] loss=1.6541 avg=1.5604 it/s=348.3
[e1 b1618/2315] loss=1.5219 avg=1.5645 it/s=348.8
[e1 b1849/2315] loss=1.4165 avg=1.5665 it/s=351.4
[e1 b2080/2315] loss=1.5755 avg=1.5680 it/s=352.3
[e1 b2311/2315] loss=1.6179 avg=1.5701 it/s=353.3


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 1/12 | loss=1.5700 | val_acc=0.2775 | val_f1=0.0869 | time=109.7s
[e2 b1/2315] loss=1.6608 avg=1.6608 it/s=349.3
[e2 b2/2315] loss=1.6035 avg=1.6321 it/s=349.1


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e2 b232/2315] loss=1.6730 avg=1.5753 it/s=377.3
[e2 b463/2315] loss=1.6403 avg=1.5805 it/s=374.7
[e2 b694/2315] loss=1.6995 avg=1.5790 it/s=374.0
[e2 b925/2315] loss=1.7749 avg=1.5788 it/s=370.8
[e2 b1156/2315] loss=1.5543 avg=1.5788 it/s=364.7
[e2 b1387/2315] loss=1.4817 avg=1.5792 it/s=363.3
[e2 b1618/2315] loss=1.5111 avg=1.5785 it/s=363.5
[e2 b1849/2315] loss=1.5856 avg=1.5783 it/s=363.3
[e2 b2080/2315] loss=1.5264 avg=1.5781 it/s=363.7
[e2 b2311/2315] loss=1.6283 avg=1.5783 it/s=364.0


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 2/12 | loss=1.5783 | val_acc=0.2775 | val_f1=0.0869 | time=106.7s
[e3 b1/2315] loss=1.6973 avg=1.6973 it/s=395.2
[e3 b2/2315] loss=1.6605 avg=1.6789 it/s=354.9


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e3 b232/2315] loss=1.5737 avg=1.5774 it/s=350.2
[e3 b463/2315] loss=1.5078 avg=1.5793 it/s=346.7
[e3 b694/2315] loss=1.5811 avg=1.5780 it/s=348.9
[e3 b925/2315] loss=1.5757 avg=1.5785 it/s=346.9
[e3 b1156/2315] loss=1.6612 avg=1.5785 it/s=346.0
[e3 b1387/2315] loss=1.5547 avg=1.5779 it/s=347.4
[e3 b1618/2315] loss=1.5660 avg=1.5780 it/s=346.0
[e3 b1849/2315] loss=1.5709 avg=1.5779 it/s=347.3
[e3 b2080/2315] loss=1.6802 avg=1.5774 it/s=349.1
[e3 b2311/2315] loss=1.5554 avg=1.5767 it/s=349.9


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 3/12 | loss=1.5768 | val_acc=0.2775 | val_f1=0.0869 | time=110.7s
[e4 b1/2315] loss=1.5357 avg=1.5357 it/s=378.0
[e4 b2/2315] loss=1.5260 avg=1.5308 it/s=311.7


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e4 b232/2315] loss=1.5854 avg=1.5756 it/s=348.2
[e4 b463/2315] loss=1.5633 avg=1.5788 it/s=349.6
[e4 b694/2315] loss=1.5377 avg=1.5756 it/s=357.4
[e4 b925/2315] loss=1.5177 avg=1.5755 it/s=361.5
[e4 b1156/2315] loss=1.5707 avg=1.5764 it/s=362.9
[e4 b1387/2315] loss=1.6020 avg=1.5764 it/s=358.6
[e4 b1618/2315] loss=1.5705 avg=1.5766 it/s=355.1
[e4 b1849/2315] loss=1.6270 avg=1.5756 it/s=354.4
[e4 b2080/2315] loss=1.5864 avg=1.5761 it/s=354.6
[e4 b2311/2315] loss=1.6322 avg=1.5760 it/s=354.9


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 4/12 | loss=1.5762 | val_acc=0.2775 | val_f1=0.0869 | time=109.3s
[e5 b1/2315] loss=1.6017 avg=1.6017 it/s=305.9
[e5 b2/2315] loss=1.5445 avg=1.5731 it/s=308.2


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e5 b232/2315] loss=1.5794 avg=1.5775 it/s=352.7
[e5 b463/2315] loss=1.5576 avg=1.5780 it/s=357.9
[e5 b694/2315] loss=1.5687 avg=1.5775 it/s=352.7
[e5 b925/2315] loss=1.5131 avg=1.5785 it/s=356.1
[e5 b1156/2315] loss=1.6051 avg=1.5787 it/s=358.3
[e5 b1387/2315] loss=1.5211 avg=1.5784 it/s=360.4
[e5 b1618/2315] loss=1.6128 avg=1.5774 it/s=356.4
[e5 b1849/2315] loss=1.5619 avg=1.5760 it/s=353.0
[e5 b2080/2315] loss=1.5520 avg=1.5760 it/s=352.1
[e5 b2311/2315] loss=1.5141 avg=1.5758 it/s=353.3


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
[34m[1mwandb[0m: [32m[41mERROR[0m The nbformat package was not found. It is required to save notebook history.


Early stopping at epoch 5


0,1
epoch,▁▁▁▁▁▁▁▁▁▁▁▃▃▃▃▃▃▃▅▅▅▅▅▅▅▆▆▆▆▆▆▆████████
lr,█▆▅▃▁
params/ratio,▁
params/total,▁
params/trainable,▁
step,▁▂▂▂▃▁▁▂▂▂▃▃▅▃▃▁▁▂▂▂▇▂▂▃▃█▁▁▂▂▂▂▂▃▃▃▂▃▃▃
time/epoch_sec,▆▁█▆▆
train/avg_loss_so_far,█▁▂▂▃▃▃▇▆▃▄▄▄▄▄█▄▄▄▄▄▄▄▃▁▄▃▃▃▃▃▅▄▄▄▄▄▄▃▃
train/epoch_loss,▁█▇▆▆
train/items_per_sec,▁▃█▇▇▇▇▇▇▇█████▇▇▇▇▇▇▇▇▇▅▇▇▇▇▇▇▇▄▄▇▇▇▇▇▇

0,1
best_checkpoint_path,checkpoints\best_mic...
best_val_f1,0.08688
epoch,5
lr,0.0005
params/ratio,0.28177
params/total,278813189
params/trainable,78561029
step,11571
time/epoch_sec,109.5185
train/avg_loss_so_far,1.57581


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
Best trial: 0. Best value: 0.0868771:  20%|██        | 4/20 [37:46<2:30:01, 562.60s/it]

[Trial 3] f1=0.0869 | unfreeze_k=11 lr=8.01e-04 wd=1.2e-06 suggested_bs=32
[I 2025-08-16 03:01:17,524] Trial 3 finished with value: 0.08687713959680486 and parameters: {'num_unfreeze_last_layers': 11, 'lr': 0.0008013317147079586, 'weight_decay': 1.1595878734573544e-06, 'batch_size': 32}. Best is trial 0 with value: 0.08687713959680486.


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/mdeberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  scaler = GradScaler(enabled=(DEVICE == "cuda" and USE_AMP))


Trainable params: 78,561,029 / 278,813,189 (28.18%) ; unfreeze_last_k=11
[e1 b1/2315] loss=1.6446 avg=1.6446 it/s=223.8
[e1 b2/2315] loss=1.6263 avg=1.6354 it/s=247.1


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e1 b232/2315] loss=1.5899 avg=1.5610 it/s=346.3
[e1 b463/2315] loss=1.5345 avg=1.5703 it/s=346.8
[e1 b694/2315] loss=1.5665 avg=1.5770 it/s=349.5
[e1 b925/2315] loss=1.6103 avg=1.5814 it/s=355.8
[e1 b1156/2315] loss=1.5047 avg=1.5823 it/s=360.8
[e1 b1387/2315] loss=1.5970 avg=1.5831 it/s=364.7
[e1 b1618/2315] loss=1.5200 avg=1.5843 it/s=363.1
[e1 b1849/2315] loss=1.5443 avg=1.5855 it/s=359.2
[e1 b2080/2315] loss=1.6031 avg=1.5847 it/s=357.1
[e1 b2311/2315] loss=1.5934 avg=1.5843 it/s=354.0


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 1/12 | loss=1.5843 | val_acc=0.2775 | val_f1=0.0869 | time=109.6s
[e2 b1/2315] loss=1.5098 avg=1.5098 it/s=409.9
[e2 b2/2315] loss=1.6034 avg=1.5566 it/s=369.5


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e2 b232/2315] loss=1.6358 avg=1.5799 it/s=332.4
[e2 b463/2315] loss=1.6434 avg=1.5815 it/s=341.6
[e2 b694/2315] loss=1.5707 avg=1.5810 it/s=346.6
[e2 b925/2315] loss=1.6916 avg=1.5781 it/s=348.7
[e2 b1156/2315] loss=1.6549 avg=1.5781 it/s=351.7
[e2 b1387/2315] loss=1.5724 avg=1.5796 it/s=354.8
[e2 b1618/2315] loss=1.6330 avg=1.5785 it/s=356.8
[e2 b1849/2315] loss=1.5563 avg=1.5782 it/s=355.0
[e2 b2080/2315] loss=1.5173 avg=1.5780 it/s=350.8
[e2 b2311/2315] loss=1.4785 avg=1.5783 it/s=348.6


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 2/12 | loss=1.5783 | val_acc=0.2775 | val_f1=0.0869 | time=111.2s
[e3 b1/2315] loss=1.6988 avg=1.6988 it/s=383.1
[e3 b2/2315] loss=1.5779 avg=1.6384 it/s=391.4


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e3 b232/2315] loss=1.5573 avg=1.5843 it/s=354.8
[e3 b463/2315] loss=1.5343 avg=1.5770 it/s=350.8
[e3 b694/2315] loss=1.6748 avg=1.5779 it/s=353.4
[e3 b925/2315] loss=1.6094 avg=1.5775 it/s=356.6
[e3 b1156/2315] loss=1.6212 avg=1.5766 it/s=354.7
[e3 b1387/2315] loss=1.6550 avg=1.5770 it/s=358.5
[e3 b1618/2315] loss=1.5789 avg=1.5757 it/s=362.1
[e3 b1849/2315] loss=1.5823 avg=1.5761 it/s=364.9
[e3 b2080/2315] loss=1.7741 avg=1.5760 it/s=362.4
[e3 b2311/2315] loss=1.6372 avg=1.5765 it/s=361.0


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 3/12 | loss=1.5765 | val_acc=0.2775 | val_f1=0.0869 | time=107.8s
[e4 b1/2315] loss=1.6831 avg=1.6831 it/s=301.0
[e4 b2/2315] loss=1.6141 avg=1.6486 it/s=324.0


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e4 b232/2315] loss=1.5656 avg=1.5756 it/s=345.9
[e4 b463/2315] loss=1.6067 avg=1.5771 it/s=350.0
[e4 b694/2315] loss=1.5584 avg=1.5774 it/s=353.8
[e4 b925/2315] loss=1.5874 avg=1.5765 it/s=355.2
[e4 b1156/2315] loss=1.5999 avg=1.5763 it/s=355.4
[e4 b1387/2315] loss=1.6144 avg=1.5762 it/s=358.4
[e4 b1618/2315] loss=1.6492 avg=1.5766 it/s=361.2
[e4 b1849/2315] loss=1.5981 avg=1.5765 it/s=364.3
[e4 b2080/2315] loss=1.6073 avg=1.5759 it/s=366.7
[e4 b2311/2315] loss=1.6157 avg=1.5760 it/s=365.1


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 4/12 | loss=1.5760 | val_acc=0.2775 | val_f1=0.0869 | time=106.4s
[e5 b1/2315] loss=1.5403 avg=1.5403 it/s=292.1
[e5 b2/2315] loss=1.6032 avg=1.5717 it/s=316.5


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e5 b232/2315] loss=1.5298 avg=1.5701 it/s=318.1
[e5 b463/2315] loss=1.6164 avg=1.5739 it/s=327.6
[e5 b694/2315] loss=1.5285 avg=1.5744 it/s=335.9
[e5 b925/2315] loss=1.5815 avg=1.5747 it/s=337.3
[e5 b1156/2315] loss=1.5149 avg=1.5757 it/s=342.3
[e5 b1387/2315] loss=1.5883 avg=1.5762 it/s=345.1
[e5 b1618/2315] loss=1.5145 avg=1.5757 it/s=345.1
[e5 b1849/2315] loss=1.5051 avg=1.5761 it/s=348.1
[e5 b2080/2315] loss=1.5570 avg=1.5759 it/s=350.8
[e5 b2311/2315] loss=1.4400 avg=1.5758 it/s=352.3


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
[34m[1mwandb[0m: [32m[41mERROR[0m The nbformat package was not found. It is required to save notebook history.


Early stopping at epoch 5


0,1
epoch,▁▁▁▁▁▁▃▃▃▃▃▃▃▃▃▅▅▅▅▅▅▅▅▅▅▆▆▆▆▆▆█████████
lr,█▆▅▃▁
params/ratio,▁
params/total,▁
params/trainable,▁
step,▁▁▁▁▁▂▂▂▂▁▂▂▂▂▄▄▂▂▂▂▂▂▂▇▂▂▂▃▁▁▁▂█▂▂▂▂▂▂▂
time/epoch_sec,▆█▃▁▆
train/avg_loss_so_far,▆▆▃▃▃▄▄▄▁▃▄▄▄▄▄▄▄█▄▃▃▃▃▃▃▇▆▃▄▃▃▃▃▂▃▃▃▃▃▃
train/epoch_loss,█▃▂▁▁
train/items_per_sec,▁▂▆▆▆▆▆▆▆█▅▅▆▆▆▆▆▆▆▇▆▆▆▆▆▆▄▅▆▆▆▆▆▆▅▅▅▆▆▆

0,1
best_checkpoint_path,checkpoints\best_mic...
best_val_f1,0.08688
epoch,5
lr,0.0006
params/ratio,0.28177
params/total,278813189
params/trainable,78561029
step,11571
time/epoch_sec,109.92836
train/avg_loss_so_far,1.57576


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
Best trial: 0. Best value: 0.0868771:  25%|██▌       | 5/20 [47:00<2:19:54, 559.66s/it]

[Trial 4] f1=0.0869 | unfreeze_k=11 lr=9.72e-04 wd=2.4e-06 suggested_bs=32
[I 2025-08-16 03:10:31,982] Trial 4 finished with value: 0.08687713959680486 and parameters: {'num_unfreeze_last_layers': 11, 'lr': 0.000972457938282282, 'weight_decay': 2.44725304634994e-06, 'batch_size': 32}. Best is trial 0 with value: 0.08687713959680486.


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/mdeberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  scaler = GradScaler(enabled=(DEVICE == "cuda" and USE_AMP))


Trainable params: 71,473,157 / 278,813,189 (25.63%) ; unfreeze_last_k=10
[e1 b1/2315] loss=1.5872 avg=1.5872 it/s=134.9
[e1 b2/2315] loss=1.6386 avg=1.6129 it/s=183.6


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e1 b232/2315] loss=1.7015 avg=1.6000 it/s=348.6
[e1 b463/2315] loss=1.7169 avg=1.5976 it/s=349.9
[e1 b694/2315] loss=1.5347 avg=1.5953 it/s=353.7
[e1 b925/2315] loss=1.5900 avg=1.5954 it/s=356.9
[e1 b1156/2315] loss=1.6938 avg=1.5934 it/s=360.5
[e1 b1387/2315] loss=1.5294 avg=1.5907 it/s=360.4
[e1 b1618/2315] loss=1.5021 avg=1.5902 it/s=362.0
[e1 b1849/2315] loss=1.5310 avg=1.5890 it/s=363.8
[e1 b2080/2315] loss=1.4835 avg=1.5868 it/s=367.0
[e1 b2311/2315] loss=1.6154 avg=1.5853 it/s=370.1


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 1/12 | loss=1.5853 | val_acc=0.2775 | val_f1=0.0869 | time=104.8s
[e2 b1/2315] loss=1.5165 avg=1.5165 it/s=366.9
[e2 b2/2315] loss=1.6145 avg=1.5655 it/s=335.7


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e2 b232/2315] loss=1.6088 avg=1.5816 it/s=365.0
[e2 b463/2315] loss=1.5169 avg=1.5802 it/s=358.7
[e2 b694/2315] loss=1.5104 avg=1.5770 it/s=354.1
[e2 b925/2315] loss=1.6194 avg=1.5770 it/s=358.4
[e2 b1156/2315] loss=1.5474 avg=1.5754 it/s=362.1
[e2 b1387/2315] loss=1.6633 avg=1.5752 it/s=366.0
[e2 b1618/2315] loss=1.6634 avg=1.5741 it/s=365.0
[e2 b1849/2315] loss=1.6291 avg=1.5750 it/s=365.4
[e2 b2080/2315] loss=1.5559 avg=1.5757 it/s=363.9
[e2 b2311/2315] loss=1.4976 avg=1.5761 it/s=365.9


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 2/12 | loss=1.5761 | val_acc=0.2775 | val_f1=0.0869 | time=105.8s
[e3 b1/2315] loss=1.5508 avg=1.5508 it/s=408.3
[e3 b2/2315] loss=1.5859 avg=1.5684 it/s=455.1


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e3 b232/2315] loss=1.5744 avg=1.5830 it/s=380.7
[e3 b463/2315] loss=1.7045 avg=1.5832 it/s=382.2
[e3 b694/2315] loss=1.5800 avg=1.5839 it/s=369.2
[e3 b925/2315] loss=1.4985 avg=1.5807 it/s=369.5
[e3 b1156/2315] loss=1.5716 avg=1.5801 it/s=366.6
[e3 b1387/2315] loss=1.6324 avg=1.5782 it/s=367.0
[e3 b1618/2315] loss=1.6817 avg=1.5782 it/s=366.3
[e3 b1849/2315] loss=1.6026 avg=1.5775 it/s=366.4
[e3 b2080/2315] loss=1.4609 avg=1.5773 it/s=367.4
[e3 b2311/2315] loss=1.6115 avg=1.5777 it/s=367.8


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 3/12 | loss=1.5778 | val_acc=0.2775 | val_f1=0.0869 | time=105.5s
[e4 b1/2315] loss=1.6581 avg=1.6581 it/s=356.8
[e4 b2/2315] loss=1.5281 avg=1.5931 it/s=419.0


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e4 b232/2315] loss=1.6411 avg=1.5722 it/s=385.1
[e4 b463/2315] loss=1.5977 avg=1.5749 it/s=389.9
[e4 b694/2315] loss=1.6205 avg=1.5766 it/s=390.6
[e4 b925/2315] loss=1.6608 avg=1.5773 it/s=385.8
[e4 b1156/2315] loss=1.6518 avg=1.5765 it/s=375.4
[e4 b1387/2315] loss=1.5106 avg=1.5760 it/s=368.0
[e4 b1618/2315] loss=1.6137 avg=1.5758 it/s=365.9
[e4 b1849/2315] loss=1.6417 avg=1.5756 it/s=367.0
[e4 b2080/2315] loss=1.6311 avg=1.5760 it/s=366.5
[e4 b2311/2315] loss=1.5174 avg=1.5756 it/s=367.5


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 4/12 | loss=1.5756 | val_acc=0.2775 | val_f1=0.0869 | time=105.8s
[e5 b1/2315] loss=1.5357 avg=1.5357 it/s=355.3
[e5 b2/2315] loss=1.5856 avg=1.5606 it/s=338.3


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e5 b232/2315] loss=1.5079 avg=1.5795 it/s=379.4
[e5 b463/2315] loss=1.5891 avg=1.5811 it/s=376.7
[e5 b694/2315] loss=1.5522 avg=1.5803 it/s=375.1
[e5 b925/2315] loss=1.6026 avg=1.5774 it/s=375.5
[e5 b1156/2315] loss=1.6286 avg=1.5764 it/s=373.8
[e5 b1387/2315] loss=1.5966 avg=1.5773 it/s=369.2
[e5 b1618/2315] loss=1.6172 avg=1.5770 it/s=367.6
[e5 b1849/2315] loss=1.5693 avg=1.5765 it/s=367.7
[e5 b2080/2315] loss=1.4663 avg=1.5763 it/s=369.7
[e5 b2311/2315] loss=1.5638 avg=1.5766 it/s=371.7


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
[34m[1mwandb[0m: [32m[41mERROR[0m The nbformat package was not found. It is required to save notebook history.


Early stopping at epoch 5


0,1
epoch,▁▁▁▁▁▁▁▁▁▁▃▃▃▃▃▃▃▃▃▃▃▅▅▅▅▆▆▆▆▆▆█████████
lr,█▆▅▃▁
params/ratio,▁
params/total,▁
params/trainable,▁
step,▁▁▁▁▂▂▂▂▁▂▂▂▃▃▃▁▁▁▁▂▂▆▂▂▂▆▂▂▂▂▂▂▁█▁▂▂▂▃▃
time/epoch_sec,▃█▇█▁
train/avg_loss_so_far,▄▆▅▅▅▅▁▃▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄█▄▄▄▄▄▄▂▃▄▄▄▄▄▄▄▄
train/epoch_loss,█▁▃▁▂
train/items_per_sec,▁▆▆▆▆▆▆▆▆▆▆▆▆▆▆▇█▆▆▆▆▆▆▆▆▆▇▆▇▇▆▆▆▆▆▆▆▆▆▆

0,1
best_checkpoint_path,checkpoints\best_mic...
best_val_f1,0.08688
epoch,5
lr,0.00346
params/ratio,0.25635
params/total,278813189
params/trainable,71473157
step,11571
time/epoch_sec,104.25235
train/avg_loss_so_far,1.57661


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
Best trial: 0. Best value: 0.0868771:  30%|███       | 6/20 [55:56<2:08:40, 551.49s/it]

[Trial 5] f1=0.0869 | unfreeze_k=10 lr=5.58e-03 wd=3.6e-05 suggested_bs=4
[I 2025-08-16 03:19:27,593] Trial 5 finished with value: 0.08687713959680486 and parameters: {'num_unfreeze_last_layers': 10, 'lr': 0.005577924778536141, 'weight_decay': 3.587344647763334e-05, 'batch_size': 4}. Best is trial 0 with value: 0.08687713959680486.


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/mdeberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  scaler = GradScaler(enabled=(DEVICE == "cuda" and USE_AMP))


Trainable params: 71,473,157 / 278,813,189 (25.63%) ; unfreeze_last_k=10
[e1 b1/2315] loss=1.5964 avg=1.5964 it/s=219.1
[e1 b2/2315] loss=1.5804 avg=1.5884 it/s=250.4


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e1 b232/2315] loss=1.4293 avg=1.5170 it/s=369.7
[e1 b463/2315] loss=1.1116 avg=1.3605 it/s=367.1
[e1 b694/2315] loss=1.6282 avg=1.3897 it/s=363.9
[e1 b925/2315] loss=1.5849 avg=1.4409 it/s=370.1
[e1 b1156/2315] loss=1.6414 avg=1.4709 it/s=375.2
[e1 b1387/2315] loss=1.6236 avg=1.4901 it/s=374.9
[e1 b1618/2315] loss=1.6635 avg=1.5032 it/s=372.9
[e1 b1849/2315] loss=1.5444 avg=1.5135 it/s=371.4
[e1 b2080/2315] loss=1.5485 avg=1.5208 it/s=368.9
[e1 b2311/2315] loss=1.5876 avg=1.5269 it/s=368.8


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 1/12 | loss=1.5270 | val_acc=0.2775 | val_f1=0.0869 | time=105.2s
[e2 b1/2315] loss=1.4975 avg=1.4975 it/s=309.3
[e2 b2/2315] loss=1.5963 avg=1.5469 it/s=358.3


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e2 b232/2315] loss=1.5886 avg=1.5846 it/s=353.5
[e2 b463/2315] loss=1.4518 avg=1.5812 it/s=361.6
[e2 b694/2315] loss=1.6056 avg=1.5817 it/s=363.6
[e2 b925/2315] loss=1.6155 avg=1.5812 it/s=366.9
[e2 b1156/2315] loss=1.6304 avg=1.5814 it/s=371.9
[e2 b1387/2315] loss=1.4889 avg=1.5803 it/s=373.1
[e2 b1618/2315] loss=1.5493 avg=1.5804 it/s=374.9
[e2 b1849/2315] loss=1.6541 avg=1.5809 it/s=370.6
[e2 b2080/2315] loss=1.6080 avg=1.5807 it/s=367.9
[e2 b2311/2315] loss=1.5669 avg=1.5804 it/s=366.1


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 2/12 | loss=1.5804 | val_acc=0.2775 | val_f1=0.0869 | time=106.0s
[e3 b1/2315] loss=1.5524 avg=1.5524 it/s=352.3
[e3 b2/2315] loss=1.6559 avg=1.6041 it/s=346.1


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e3 b232/2315] loss=1.5415 avg=1.5749 it/s=377.2
[e3 b463/2315] loss=1.5897 avg=1.5753 it/s=375.1
[e3 b694/2315] loss=1.6474 avg=1.5754 it/s=373.7
[e3 b925/2315] loss=1.4304 avg=1.5759 it/s=371.3
[e3 b1156/2315] loss=1.5343 avg=1.5773 it/s=368.1
[e3 b1387/2315] loss=1.6368 avg=1.5778 it/s=371.5
[e3 b1618/2315] loss=1.7438 avg=1.5783 it/s=374.4
[e3 b1849/2315] loss=1.4793 avg=1.5786 it/s=377.1
[e3 b2080/2315] loss=1.6489 avg=1.5786 it/s=375.2
[e3 b2311/2315] loss=1.5204 avg=1.5787 it/s=371.4


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 3/12 | loss=1.5787 | val_acc=0.2775 | val_f1=0.0869 | time=104.8s
[e4 b1/2315] loss=1.6637 avg=1.6637 it/s=298.7
[e4 b2/2315] loss=1.6908 avg=1.6772 it/s=299.5


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e4 b232/2315] loss=1.6486 avg=1.5825 it/s=349.6
[e4 b463/2315] loss=1.5055 avg=1.5767 it/s=350.3
[e4 b694/2315] loss=1.6754 avg=1.5768 it/s=354.5
[e4 b925/2315] loss=1.6348 avg=1.5767 it/s=357.6
[e4 b1156/2315] loss=1.6478 avg=1.5768 it/s=359.0
[e4 b1387/2315] loss=1.5606 avg=1.5776 it/s=359.3
[e4 b1618/2315] loss=1.5329 avg=1.5779 it/s=363.9
[e4 b1849/2315] loss=1.6214 avg=1.5778 it/s=367.9
[e4 b2080/2315] loss=1.5702 avg=1.5779 it/s=371.5
[e4 b2311/2315] loss=1.5989 avg=1.5780 it/s=373.1


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 4/12 | loss=1.5780 | val_acc=0.2775 | val_f1=0.0869 | time=104.0s
[e5 b1/2315] loss=1.5773 avg=1.5773 it/s=356.1
[e5 b2/2315] loss=1.5464 avg=1.5619 it/s=392.0


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e5 b232/2315] loss=1.5407 avg=1.5784 it/s=358.3
[e5 b463/2315] loss=1.4804 avg=1.5783 it/s=348.8
[e5 b694/2315] loss=1.7122 avg=1.5790 it/s=346.3
[e5 b925/2315] loss=1.5106 avg=1.5772 it/s=346.6
[e5 b1156/2315] loss=1.5701 avg=1.5767 it/s=345.7
[e5 b1387/2315] loss=1.6431 avg=1.5775 it/s=350.0
[e5 b1618/2315] loss=1.5932 avg=1.5776 it/s=353.5
[e5 b1849/2315] loss=1.4727 avg=1.5762 it/s=354.0
[e5 b2080/2315] loss=1.5904 avg=1.5764 it/s=357.7
[e5 b2311/2315] loss=1.5821 avg=1.5771 it/s=360.2


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
[34m[1mwandb[0m: [32m[41mERROR[0m The nbformat package was not found. It is required to save notebook history.


Early stopping at epoch 5


0,1
epoch,▁▁▁▁▁▁▃▃▃▃▃▃▃▃▃▃▃▅▅▅▅▅▅▅▆▆▆▆▆▆██████████
lr,█▆▅▃▁
params/ratio,▁
params/total,▁
params/trainable,▁
step,▂▂▄▅▅▇██▄▆▁▂▂▂▃▅▆▆▇█▁▁▂▃▄▅▆▇██▂▃▃▅▅▆▇▇▇█
time/epoch_sec,▃▅▃▁█
train/avg_loss_so_far,▆▆▄▁▂▃▄▄▅▄▆▆▆▆▆▆▆▅▆▆▆▆▆▆█▆▆▆▆▆▆▆▅▆▆▆▆▆▆▆
train/epoch_loss,▁████
train/items_per_sec,▁▂▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▄▄▆▆▇▇▇▇▇▇▇█▇▆▆▆▆▆▇

0,1
best_checkpoint_path,checkpoints\best_mic...
best_val_f1,0.08688
epoch,5
lr,0.00027
params/ratio,0.25635
params/total,278813189
params/trainable,71473157
step,11571
time/epoch_sec,107.5273
train/avg_loss_so_far,1.57708


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
Best trial: 0. Best value: 0.0868771:  35%|███▌      | 7/20 [1:04:53<1:58:27, 546.77s/it]

[Trial 6] f1=0.0869 | unfreeze_k=10 lr=4.27e-04 wd=1.1e-05 suggested_bs=16
[I 2025-08-16 03:28:24,646] Trial 6 finished with value: 0.08687713959680486 and parameters: {'num_unfreeze_last_layers': 10, 'lr': 0.00042707494144126896, 'weight_decay': 1.1005216961578654e-05, 'batch_size': 16}. Best is trial 0 with value: 0.08687713959680486.


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/mdeberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  scaler = GradScaler(enabled=(DEVICE == "cuda" and USE_AMP))


Trainable params: 57,297,413 / 278,813,189 (20.55%) ; unfreeze_last_k=8
[e1 b1/2315] loss=1.6088 avg=1.6088 it/s=267.3
[e1 b2/2315] loss=1.6936 avg=1.6512 it/s=296.5


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e1 b232/2315] loss=1.5855 avg=1.5615 it/s=380.1
[e1 b463/2315] loss=1.2108 avg=1.4585 it/s=380.3
[e1 b694/2315] loss=1.0253 avg=1.3466 it/s=385.1
[e1 b925/2315] loss=1.3797 avg=1.2653 it/s=391.4
[e1 b1156/2315] loss=0.5378 avg=1.1974 it/s=399.3
[e1 b1387/2315] loss=0.6789 avg=1.1460 it/s=401.4
[e1 b1618/2315] loss=1.1922 avg=1.1068 it/s=404.2
[e1 b1849/2315] loss=0.8633 avg=1.0748 it/s=407.1
[e1 b2080/2315] loss=0.7380 avg=1.0510 it/s=410.7
[e1 b2311/2315] loss=0.8870 avg=1.0226 it/s=412.5


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 1/12 | loss=1.0222 | val_acc=0.7259 | val_f1=0.7356 | time=94.6s
[e2 b1/2315] loss=0.5331 avg=0.5331 it/s=389.1
[e2 b2/2315] loss=0.6935 avg=0.6133 it/s=384.5


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e2 b232/2315] loss=0.7953 avg=0.8235 it/s=450.9
[e2 b463/2315] loss=1.1513 avg=0.8316 it/s=451.0
[e2 b694/2315] loss=0.5906 avg=0.8264 it/s=443.8
[e2 b925/2315] loss=1.2958 avg=0.8114 it/s=422.0
[e2 b1156/2315] loss=0.9201 avg=0.8019 it/s=409.1
[e2 b1387/2315] loss=1.0292 avg=0.7990 it/s=407.2
[e2 b1618/2315] loss=0.9451 avg=0.7901 it/s=408.5
[e2 b1849/2315] loss=0.8406 avg=0.7839 it/s=408.3
[e2 b2080/2315] loss=0.7381 avg=0.7748 it/s=408.9
[e2 b2311/2315] loss=0.7473 avg=0.7671 it/s=408.9


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 2/12 | loss=0.7675 | val_acc=0.7502 | val_f1=0.7579 | time=95.3s
[e3 b1/2315] loss=0.7626 avg=0.7626 it/s=321.2
[e3 b2/2315] loss=0.3377 avg=0.5502 it/s=348.7


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e3 b232/2315] loss=0.8601 avg=0.6551 it/s=407.9
[e3 b463/2315] loss=0.3726 avg=0.6491 it/s=411.3
[e3 b694/2315] loss=0.3938 avg=0.6431 it/s=424.3
[e3 b925/2315] loss=0.2758 avg=0.6454 it/s=430.6
[e3 b1156/2315] loss=0.6477 avg=0.6361 it/s=431.6
[e3 b1387/2315] loss=0.2946 avg=0.6300 it/s=428.9
[e3 b1618/2315] loss=0.7511 avg=0.6264 it/s=420.5
[e3 b1849/2315] loss=0.4650 avg=0.6218 it/s=418.3
[e3 b2080/2315] loss=0.6790 avg=0.6155 it/s=417.6
[e3 b2311/2315] loss=0.4684 avg=0.6152 it/s=417.3


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 3/12 | loss=0.6149 | val_acc=0.7677 | val_f1=0.7737 | time=93.6s
[e4 b1/2315] loss=0.7162 avg=0.7162 it/s=318.7
[e4 b2/2315] loss=0.7587 avg=0.7375 it/s=407.5


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e4 b232/2315] loss=0.1359 avg=0.5629 it/s=397.1
[e4 b463/2315] loss=0.5230 avg=0.5554 it/s=407.9
[e4 b694/2315] loss=0.2238 avg=0.5467 it/s=407.3
[e4 b925/2315] loss=0.6031 avg=0.5337 it/s=407.9
[e4 b1156/2315] loss=0.6168 avg=0.5272 it/s=412.3
[e4 b1387/2315] loss=0.3383 avg=0.5297 it/s=415.9
[e4 b1618/2315] loss=0.5899 avg=0.5290 it/s=420.2
[e4 b1849/2315] loss=0.5697 avg=0.5286 it/s=422.7
[e4 b2080/2315] loss=0.4777 avg=0.5256 it/s=419.5
[e4 b2311/2315] loss=0.3851 avg=0.5215 it/s=415.8


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 4/12 | loss=0.5212 | val_acc=0.8260 | val_f1=0.8315 | time=94.0s
[e5 b1/2315] loss=0.2250 avg=0.2250 it/s=345.4
[e5 b2/2315] loss=0.4246 avg=0.3248 it/s=369.9


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e5 b232/2315] loss=0.7085 avg=0.4426 it/s=410.4
[e5 b463/2315] loss=0.2381 avg=0.4431 it/s=421.1
[e5 b694/2315] loss=0.3406 avg=0.4430 it/s=427.8
[e5 b925/2315] loss=0.1217 avg=0.4459 it/s=421.6
[e5 b1156/2315] loss=0.5031 avg=0.4489 it/s=425.1
[e5 b1387/2315] loss=0.3544 avg=0.4476 it/s=422.6
[e5 b1618/2315] loss=0.3058 avg=0.4438 it/s=422.1
[e5 b1849/2315] loss=0.2810 avg=0.4423 it/s=423.6
[e5 b2080/2315] loss=0.4732 avg=0.4448 it/s=424.6
[e5 b2311/2315] loss=1.2225 avg=0.4453 it/s=426.5


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 5/12 | loss=0.4453 | val_acc=0.8355 | val_f1=0.8406 | time=91.6s
[e6 b1/2315] loss=0.2308 avg=0.2308 it/s=400.5
[e6 b2/2315] loss=0.2252 avg=0.2280 it/s=425.1


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e6 b232/2315] loss=0.5146 avg=0.3830 it/s=388.0
[e6 b463/2315] loss=0.0987 avg=0.3843 it/s=395.7
[e6 b694/2315] loss=0.2279 avg=0.3958 it/s=400.2
[e6 b925/2315] loss=0.7978 avg=0.3942 it/s=407.5
[e6 b1156/2315] loss=0.8454 avg=0.3938 it/s=409.6
[e6 b1387/2315] loss=0.1906 avg=0.3974 it/s=413.7
[e6 b1618/2315] loss=0.9221 avg=0.3955 it/s=414.7
[e6 b1849/2315] loss=0.5118 avg=0.3923 it/s=415.9
[e6 b2080/2315] loss=0.5120 avg=0.3931 it/s=414.2
[e6 b2311/2315] loss=0.1925 avg=0.3923 it/s=412.7


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 6/12 | loss=0.3925 | val_acc=0.8397 | val_f1=0.8448 | time=94.5s
[e7 b1/2315] loss=0.1280 avg=0.1280 it/s=426.2
[e7 b2/2315] loss=0.4097 avg=0.2688 it/s=420.8


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e7 b232/2315] loss=0.0646 avg=0.3412 it/s=443.8
[e7 b463/2315] loss=0.2577 avg=0.3384 it/s=448.1
[e7 b694/2315] loss=0.7532 avg=0.3455 it/s=447.6
[e7 b925/2315] loss=0.1608 avg=0.3520 it/s=435.4
[e7 b1156/2315] loss=0.2818 avg=0.3521 it/s=426.6
[e7 b1387/2315] loss=0.2868 avg=0.3501 it/s=421.6
[e7 b1618/2315] loss=0.3685 avg=0.3473 it/s=418.4
[e7 b1849/2315] loss=0.1103 avg=0.3460 it/s=415.0
[e7 b2080/2315] loss=0.0447 avg=0.3433 it/s=415.1
[e7 b2311/2315] loss=0.1556 avg=0.3446 it/s=415.0


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 7/12 | loss=0.3445 | val_acc=0.8431 | val_f1=0.8489 | time=94.1s
[e8 b1/2315] loss=0.5413 avg=0.5413 it/s=386.3
[e8 b2/2315] loss=0.0901 avg=0.3157 it/s=414.1


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e8 b232/2315] loss=0.0318 avg=0.2792 it/s=417.8
[e8 b463/2315] loss=0.1715 avg=0.2936 it/s=416.6
[e8 b694/2315] loss=0.5437 avg=0.2947 it/s=421.7
[e8 b925/2315] loss=0.1087 avg=0.2943 it/s=425.7
[e8 b1156/2315] loss=0.4963 avg=0.2950 it/s=427.6
[e8 b1387/2315] loss=0.3400 avg=0.2916 it/s=429.3
[e8 b1618/2315] loss=0.6454 avg=0.2909 it/s=422.1
[e8 b1849/2315] loss=0.0312 avg=0.2912 it/s=417.0
[e8 b2080/2315] loss=0.1026 avg=0.2923 it/s=413.9
[e8 b2311/2315] loss=0.2240 avg=0.2919 it/s=414.0


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 8/12 | loss=0.2918 | val_acc=0.8416 | val_f1=0.8457 | time=94.2s
[e9 b1/2315] loss=0.3393 avg=0.3393 it/s=436.4
[e9 b2/2315] loss=0.2149 avg=0.2771 it/s=425.2


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e9 b232/2315] loss=0.0461 avg=0.2330 it/s=424.0
[e9 b463/2315] loss=0.2691 avg=0.2407 it/s=422.7
[e9 b694/2315] loss=0.6171 avg=0.2426 it/s=419.0
[e9 b925/2315] loss=0.0374 avg=0.2434 it/s=419.3
[e9 b1156/2315] loss=0.2369 avg=0.2432 it/s=421.3
[e9 b1387/2315] loss=0.3221 avg=0.2448 it/s=424.4
[e9 b1618/2315] loss=0.3104 avg=0.2483 it/s=426.2
[e9 b1849/2315] loss=1.1139 avg=0.2494 it/s=429.2
[e9 b2080/2315] loss=0.0212 avg=0.2487 it/s=431.4
[e9 b2311/2315] loss=0.0080 avg=0.2480 it/s=428.6


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 9/12 | loss=0.2478 | val_acc=0.8571 | val_f1=0.8624 | time=91.7s
[e10 b1/2315] loss=0.0115 avg=0.0115 it/s=358.4
[e10 b2/2315] loss=0.3032 avg=0.1573 it/s=372.0


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e10 b232/2315] loss=0.1145 avg=0.1873 it/s=382.8
[e10 b463/2315] loss=0.0120 avg=0.1999 it/s=388.7
[e10 b694/2315] loss=0.6186 avg=0.2106 it/s=401.3
[e10 b925/2315] loss=0.2373 avg=0.2122 it/s=407.6
[e10 b1156/2315] loss=0.0178 avg=0.2117 it/s=410.6
[e10 b1387/2315] loss=0.1070 avg=0.2125 it/s=408.6
[e10 b1618/2315] loss=0.1147 avg=0.2107 it/s=402.1
[e10 b1849/2315] loss=0.2186 avg=0.2078 it/s=402.5
[e10 b2080/2315] loss=0.2331 avg=0.2087 it/s=407.2
[e10 b2311/2315] loss=1.0675 avg=0.2080 it/s=410.3


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 10/12 | loss=0.2082 | val_acc=0.8576 | val_f1=0.8625 | time=95.1s
[e11 b1/2315] loss=0.1540 avg=0.1540 it/s=353.3
[e11 b2/2315] loss=0.6777 avg=0.4158 it/s=425.5


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e11 b232/2315] loss=0.5734 avg=0.1887 it/s=446.2
[e11 b463/2315] loss=0.3417 avg=0.1795 it/s=423.1
[e11 b694/2315] loss=0.0059 avg=0.1790 it/s=413.6
[e11 b925/2315] loss=0.7780 avg=0.1814 it/s=409.3
[e11 b1156/2315] loss=0.0257 avg=0.1839 it/s=407.2
[e11 b1387/2315] loss=0.0538 avg=0.1781 it/s=410.0
[e11 b1618/2315] loss=0.3056 avg=0.1774 it/s=410.6
[e11 b1849/2315] loss=0.0070 avg=0.1781 it/s=410.9
[e11 b2080/2315] loss=0.3084 avg=0.1762 it/s=410.8
[e11 b2311/2315] loss=0.0138 avg=0.1751 it/s=412.9


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 11/12 | loss=0.1750 | val_acc=0.8613 | val_f1=0.8661 | time=94.4s
[e12 b1/2315] loss=0.2205 avg=0.2205 it/s=412.5
[e12 b2/2315] loss=0.2044 avg=0.2125 it/s=371.0


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e12 b232/2315] loss=0.0325 avg=0.1424 it/s=440.1
[e12 b463/2315] loss=0.9202 avg=0.1486 it/s=446.9
[e12 b694/2315] loss=0.0096 avg=0.1506 it/s=448.0
[e12 b925/2315] loss=0.0058 avg=0.1503 it/s=447.2
[e12 b1156/2315] loss=0.0104 avg=0.1456 it/s=439.3
[e12 b1387/2315] loss=0.0054 avg=0.1479 it/s=433.8
[e12 b1618/2315] loss=0.4413 avg=0.1509 it/s=427.2
[e12 b1849/2315] loss=0.0096 avg=0.1505 it/s=425.1
[e12 b2080/2315] loss=0.4762 avg=0.1471 it/s=422.9
[e12 b2311/2315] loss=0.2654 avg=0.1478 it/s=422.9


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
[34m[1mwandb[0m: [32m[41mERROR[0m The nbformat package was not found. It is required to save notebook history.


Epoch 12/12 | loss=0.1478 | val_acc=0.8593 | val_f1=0.8639 | time=92.4s


0,1
epoch,▁▁▁▁▂▂▂▂▂▂▂▂▂▃▃▃▄▄▄▄▄▄▅▅▅▅▆▆▆▆▇▇▇▇▇▇▇███
lr,█▇▇▆▅▅▄▄▃▂▂▁
params/ratio,▁
params/total,▁
params/trainable,▁
step,▁▁▁▁▂▁▂▂▂▂▁▃▁▂▁▂▁▁▁▁▂▁▆▁▁▂▇▁▁▁▁█▁▂▂▁▁▂▂▂
time/epoch_sec,▇█▅▅▁▇▆▆▁█▆▃
train/avg_loss_so_far,█▇▆▅▃▄▄▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁
train/epoch_loss,█▆▅▄▃▃▃▂▂▁▁▁
train/items_per_sec,▄▅▆▆▆▆▇▁▃▆▇▅▇▆▆▆▇▇▆▆▇█▇▇▇▇▇▇▇▃▄▆▆▅▇▆▆▃██

0,1
best_checkpoint_path,checkpoints\best_mic...
best_val_f1,0.86611
epoch,12
lr,0
params/ratio,0.2055
params/total,278813189
params/trainable,57297413
step,27776
time/epoch_sec,92.42218
train/avg_loss_so_far,0.14782


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
Best trial: 7. Best value: 0.866111:  40%|████      | 8/20 [1:23:57<2:27:21, 736.83s/it] 

[Trial 7] f1=0.8661 | unfreeze_k=8 lr=1.18e-04 wd=3.0e-05 suggested_bs=64
[I 2025-08-16 03:47:28,442] Trial 7 finished with value: 0.8661112815752661 and parameters: {'num_unfreeze_last_layers': 8, 'lr': 0.00011796756021837212, 'weight_decay': 2.981045031286568e-05, 'batch_size': 64}. Best is trial 7 with value: 0.8661112815752661.


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/mdeberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  scaler = GradScaler(enabled=(DEVICE == "cuda" and USE_AMP))


Trainable params: 85,648,901 / 278,813,189 (30.72%) ; unfreeze_last_k=12
[e1 b1/2315] loss=1.5906 avg=1.5906 it/s=116.7
[e1 b2/2315] loss=1.6004 avg=1.5955 it/s=162.4


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e1 b232/2315] loss=1.5227 avg=1.5610 it/s=335.7
[e1 b463/2315] loss=1.1630 avg=1.4573 it/s=339.9
[e1 b694/2315] loss=1.0807 avg=1.3281 it/s=343.5
[e1 b925/2315] loss=0.8026 avg=1.2396 it/s=344.3
[e1 b1156/2315] loss=0.7963 avg=1.1762 it/s=347.0
[e1 b1387/2315] loss=1.0412 avg=1.1352 it/s=340.7
[e1 b1618/2315] loss=0.6206 avg=1.1053 it/s=336.3
[e1 b1849/2315] loss=0.6923 avg=1.0804 it/s=336.1
[e1 b2080/2315] loss=0.9028 avg=1.0614 it/s=337.2
[e1 b2311/2315] loss=0.7917 avg=1.0444 it/s=338.5


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 1/12 | loss=1.0440 | val_acc=0.7058 | val_f1=0.7143 | time=114.1s
[e2 b1/2315] loss=0.6460 avg=0.6460 it/s=227.4
[e2 b2/2315] loss=0.6993 avg=0.6726 it/s=268.9


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e2 b232/2315] loss=1.0409 avg=0.8494 it/s=328.4
[e2 b463/2315] loss=0.7749 avg=0.8555 it/s=325.4
[e2 b694/2315] loss=0.7434 avg=0.8376 it/s=330.2
[e2 b925/2315] loss=0.6527 avg=0.8329 it/s=334.3
[e2 b1156/2315] loss=0.6264 avg=0.8318 it/s=338.9
[e2 b1387/2315] loss=0.6620 avg=0.8192 it/s=337.8
[e2 b1618/2315] loss=0.4748 avg=0.8078 it/s=333.0
[e2 b1849/2315] loss=0.9291 avg=0.7996 it/s=331.2
[e2 b2080/2315] loss=0.5288 avg=0.7921 it/s=330.1
[e2 b2311/2315] loss=0.8770 avg=0.7865 it/s=330.0


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 2/12 | loss=0.7863 | val_acc=0.7247 | val_f1=0.7314 | time=117.0s
[e3 b1/2315] loss=0.7788 avg=0.7788 it/s=307.3
[e3 b2/2315] loss=0.8510 avg=0.8149 it/s=301.8


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e3 b232/2315] loss=0.9150 avg=0.7552 it/s=327.7
[e3 b463/2315] loss=0.4817 avg=0.7627 it/s=325.5
[e3 b694/2315] loss=0.7236 avg=0.7532 it/s=327.7
[e3 b925/2315] loss=0.1977 avg=0.7425 it/s=333.7
[e3 b1156/2315] loss=0.7167 avg=0.7328 it/s=337.0
[e3 b1387/2315] loss=0.6389 avg=0.7221 it/s=335.1
[e3 b1618/2315] loss=0.5884 avg=0.7144 it/s=332.2
[e3 b1849/2315] loss=0.3464 avg=0.7116 it/s=328.7
[e3 b2080/2315] loss=0.8675 avg=0.7153 it/s=328.1
[e3 b2311/2315] loss=0.7048 avg=0.7129 it/s=329.0


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 3/12 | loss=0.7128 | val_acc=0.7515 | val_f1=0.7608 | time=117.4s
[e4 b1/2315] loss=0.4075 avg=0.4075 it/s=238.6
[e4 b2/2315] loss=0.5164 avg=0.4619 it/s=263.9


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e4 b232/2315] loss=0.8834 avg=0.6276 it/s=335.4
[e4 b463/2315] loss=0.6995 avg=0.6386 it/s=339.8
[e4 b694/2315] loss=0.4644 avg=0.6335 it/s=340.7
[e4 b925/2315] loss=0.4245 avg=0.6311 it/s=344.1
[e4 b1156/2315] loss=0.6617 avg=0.6371 it/s=346.3
[e4 b1387/2315] loss=0.6949 avg=0.6343 it/s=346.6
[e4 b1618/2315] loss=0.7790 avg=0.6353 it/s=340.6
[e4 b1849/2315] loss=0.4211 avg=0.6356 it/s=338.4
[e4 b2080/2315] loss=0.4639 avg=0.6324 it/s=337.6
[e4 b2311/2315] loss=0.3761 avg=0.6276 it/s=338.4


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 4/12 | loss=0.6277 | val_acc=0.7949 | val_f1=0.8012 | time=114.3s
[e5 b1/2315] loss=0.7334 avg=0.7334 it/s=297.8
[e5 b2/2315] loss=0.3572 avg=0.5453 it/s=295.3


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e5 b232/2315] loss=0.4321 avg=0.5281 it/s=341.2
[e5 b463/2315] loss=0.7806 avg=0.5578 it/s=327.6
[e5 b694/2315] loss=0.9168 avg=0.5563 it/s=327.2
[e5 b925/2315] loss=0.8039 avg=0.5601 it/s=333.6
[e5 b1156/2315] loss=0.6044 avg=0.5567 it/s=338.5
[e5 b1387/2315] loss=0.3032 avg=0.5570 it/s=341.6
[e5 b1618/2315] loss=0.6337 avg=0.5533 it/s=339.6
[e5 b1849/2315] loss=0.4399 avg=0.5504 it/s=336.3
[e5 b2080/2315] loss=0.6536 avg=0.5475 it/s=334.5
[e5 b2311/2315] loss=0.7084 avg=0.5453 it/s=331.8


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 5/12 | loss=0.5459 | val_acc=0.7940 | val_f1=0.8023 | time=116.7s
[e6 b1/2315] loss=0.7295 avg=0.7295 it/s=288.1
[e6 b2/2315] loss=0.5249 avg=0.6272 it/s=311.9


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e6 b232/2315] loss=0.2257 avg=0.4799 it/s=331.1
[e6 b463/2315] loss=0.6032 avg=0.4824 it/s=336.2
[e6 b694/2315] loss=0.3541 avg=0.4916 it/s=338.0
[e6 b925/2315] loss=1.0902 avg=0.4956 it/s=340.0
[e6 b1156/2315] loss=0.3937 avg=0.4984 it/s=343.1
[e6 b1387/2315] loss=0.7798 avg=0.4975 it/s=344.9
[e6 b1618/2315] loss=0.2966 avg=0.4913 it/s=341.6
[e6 b1849/2315] loss=0.2350 avg=0.4903 it/s=336.3
[e6 b2080/2315] loss=0.5908 avg=0.4880 it/s=333.1
[e6 b2311/2315] loss=0.5248 avg=0.4827 it/s=334.7


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 6/12 | loss=0.4830 | val_acc=0.8039 | val_f1=0.8054 | time=115.4s
[e7 b1/2315] loss=0.3346 avg=0.3346 it/s=313.4
[e7 b2/2315] loss=0.3684 avg=0.3515 it/s=306.9


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e7 b232/2315] loss=0.1581 avg=0.4367 it/s=343.6
[e7 b463/2315] loss=0.6417 avg=0.4272 it/s=341.5
[e7 b694/2315] loss=0.1136 avg=0.4262 it/s=337.0
[e7 b925/2315] loss=0.5892 avg=0.4200 it/s=338.8
[e7 b1156/2315] loss=0.6244 avg=0.4186 it/s=341.6
[e7 b1387/2315] loss=0.4607 avg=0.4157 it/s=343.1
[e7 b1618/2315] loss=0.3270 avg=0.4119 it/s=343.1
[e7 b1849/2315] loss=0.2044 avg=0.4085 it/s=341.1
[e7 b2080/2315] loss=0.2486 avg=0.4067 it/s=339.6
[e7 b2311/2315] loss=0.2738 avg=0.4062 it/s=337.8


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 7/12 | loss=0.4064 | val_acc=0.8226 | val_f1=0.8283 | time=114.6s
[e8 b1/2315] loss=0.2750 avg=0.2750 it/s=276.5
[e8 b2/2315] loss=0.0707 avg=0.1729 it/s=341.2


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e8 b232/2315] loss=0.4178 avg=0.3702 it/s=346.6
[e8 b463/2315] loss=0.5429 avg=0.3445 it/s=335.6
[e8 b694/2315] loss=0.1056 avg=0.3459 it/s=331.3
[e8 b925/2315] loss=0.4455 avg=0.3490 it/s=329.1
[e8 b1156/2315] loss=0.2290 avg=0.3476 it/s=334.3
[e8 b1387/2315] loss=0.1374 avg=0.3479 it/s=337.9
[e8 b1618/2315] loss=0.0688 avg=0.3488 it/s=341.2
[e8 b1849/2315] loss=0.0827 avg=0.3469 it/s=339.8
[e8 b2080/2315] loss=0.2884 avg=0.3462 it/s=338.3
[e8 b2311/2315] loss=0.8219 avg=0.3460 it/s=338.5


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 8/12 | loss=0.3460 | val_acc=0.8414 | val_f1=0.8461 | time=114.2s
[e9 b1/2315] loss=0.0603 avg=0.0603 it/s=267.9
[e9 b2/2315] loss=0.2789 avg=0.1696 it/s=295.5


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e9 b232/2315] loss=0.4551 avg=0.2963 it/s=337.5
[e9 b463/2315] loss=0.0233 avg=0.2996 it/s=341.0
[e9 b694/2315] loss=0.0943 avg=0.2913 it/s=338.8
[e9 b925/2315] loss=0.3323 avg=0.2971 it/s=335.3
[e9 b1156/2315] loss=0.0689 avg=0.3004 it/s=336.5
[e9 b1387/2315] loss=0.2972 avg=0.3005 it/s=339.0
[e9 b1618/2315] loss=0.1462 avg=0.2991 it/s=341.9
[e9 b1849/2315] loss=0.0763 avg=0.2961 it/s=339.8
[e9 b2080/2315] loss=0.5195 avg=0.3001 it/s=334.9
[e9 b2311/2315] loss=0.1937 avg=0.3005 it/s=332.5


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 9/12 | loss=0.3004 | val_acc=0.8372 | val_f1=0.8420 | time=116.4s
[e10 b1/2315] loss=0.3307 avg=0.3307 it/s=324.2
[e10 b2/2315] loss=0.4196 avg=0.3752 it/s=353.8


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e10 b232/2315] loss=0.7900 avg=0.2830 it/s=324.1
[e10 b463/2315] loss=0.0722 avg=0.2659 it/s=325.3
[e10 b694/2315] loss=0.3437 avg=0.2638 it/s=330.6
[e10 b925/2315] loss=0.7438 avg=0.2647 it/s=334.2
[e10 b1156/2315] loss=0.2349 avg=0.2585 it/s=336.0
[e10 b1387/2315] loss=0.6575 avg=0.2579 it/s=339.1
[e10 b1618/2315] loss=0.1110 avg=0.2520 it/s=342.0
[e10 b1849/2315] loss=0.0653 avg=0.2518 it/s=343.2
[e10 b2080/2315] loss=0.1145 avg=0.2511 it/s=343.0
[e10 b2311/2315] loss=0.2309 avg=0.2502 it/s=340.1


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 10/12 | loss=0.2501 | val_acc=0.8438 | val_f1=0.8479 | time=113.7s
[e11 b1/2315] loss=0.4378 avg=0.4378 it/s=314.6
[e11 b2/2315] loss=0.4188 avg=0.4283 it/s=317.2


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e11 b232/2315] loss=0.3313 avg=0.2077 it/s=337.3
[e11 b463/2315] loss=0.0290 avg=0.2085 it/s=341.2
[e11 b694/2315] loss=0.6323 avg=0.2096 it/s=337.0
[e11 b925/2315] loss=0.0332 avg=0.2090 it/s=333.9
[e11 b1156/2315] loss=0.7619 avg=0.2083 it/s=330.8
[e11 b1387/2315] loss=0.0086 avg=0.2088 it/s=328.0
[e11 b1618/2315] loss=0.3504 avg=0.2083 it/s=327.5
[e11 b1849/2315] loss=0.2199 avg=0.2092 it/s=324.2
[e11 b2080/2315] loss=0.3522 avg=0.2086 it/s=324.3
[e11 b2311/2315] loss=0.2729 avg=0.2046 it/s=324.3


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 11/12 | loss=0.2046 | val_acc=0.8593 | val_f1=0.8627 | time=119.3s
[e12 b1/2315] loss=0.0186 avg=0.0186 it/s=310.8
[e12 b2/2315] loss=0.0057 avg=0.0122 it/s=332.1


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e12 b232/2315] loss=0.0102 avg=0.1612 it/s=307.7
[e12 b463/2315] loss=0.3372 avg=0.1655 it/s=313.2
[e12 b694/2315] loss=0.0847 avg=0.1673 it/s=317.3
[e12 b925/2315] loss=0.0064 avg=0.1677 it/s=322.4
[e12 b1156/2315] loss=0.2920 avg=0.1673 it/s=324.9
[e12 b1387/2315] loss=0.0274 avg=0.1682 it/s=328.1
[e12 b1618/2315] loss=0.0739 avg=0.1704 it/s=331.2
[e12 b1849/2315] loss=0.0179 avg=0.1688 it/s=333.6
[e12 b2080/2315] loss=0.0215 avg=0.1696 it/s=330.2
[e12 b2311/2315] loss=0.3403 avg=0.1676 it/s=329.5


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
[34m[1mwandb[0m: [32m[41mERROR[0m The nbformat package was not found. It is required to save notebook history.


Epoch 12/12 | loss=0.1680 | val_acc=0.8557 | val_f1=0.8593 | time=117.3s


0,1
epoch,▁▁▁▁▁▂▂▂▂▂▃▃▃▃▄▄▄▄▅▅▅▅▅▅▅▅▆▆▆▆▆▆▆▇▇▇████
lr,█▇▇▆▅▅▄▄▃▂▂▁
params/ratio,▁
params/total,▁
params/trainable,▁
step,▁▁▂▂▂▂▁▂▁▁▂▂▂▂▁▁▂▂▁▁▁▁▁▁▂▂▂█▁▂▁▂▁▁▁▂▁▁▁▂
time/epoch_sec,▂▅▆▂▅▃▂▂▄▁█▆
train/avg_loss_so_far,█▆▆▆▆▄▅▅▅▄▄▃▄▃▃▃▃▃▃▃▃▃▂▃▃▃▂▂▂▂▁▂▂▂▂▂▂▁▂▂
train/epoch_loss,█▆▅▅▄▄▃▂▂▂▁▁
train/items_per_sec,▁██▇█▇▇▇▇▄▇█████████▇███▆████▇▇██▇█▇▇▇▇▇

0,1
best_checkpoint_path,checkpoints\best_mic...
best_val_f1,0.86273
epoch,12
lr,0
params/ratio,0.30719
params/total,278813189
params/trainable,85648901
step,27776
time/epoch_sec,117.31185
train/avg_loss_so_far,0.16763


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
Best trial: 7. Best value: 0.866111:  45%|████▌     | 9/20 [1:47:27<2:53:40, 947.34s/it]

[Trial 8] f1=0.8627 | unfreeze_k=12 lr=1.12e-04 wd=1.9e-06 suggested_bs=64
[I 2025-08-16 04:10:58,654] Trial 8 finished with value: 0.8627319723752823 and parameters: {'num_unfreeze_last_layers': 12, 'lr': 0.00011188289822099445, 'weight_decay': 1.907820133444184e-06, 'batch_size': 64}. Best is trial 7 with value: 0.8661112815752661.


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/mdeberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  scaler = GradScaler(enabled=(DEVICE == "cuda" and USE_AMP))


Trainable params: 85,648,901 / 278,813,189 (30.72%) ; unfreeze_last_k=12
[e1 b1/2315] loss=1.6074 avg=1.6074 it/s=107.5


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e1 b2/2315] loss=1.6571 avg=1.6323 it/s=149.0
[e1 b232/2315] loss=1.6314 avg=1.5917 it/s=336.5
[e1 b463/2315] loss=1.5659 avg=1.5916 it/s=341.1
[e1 b694/2315] loss=1.8088 avg=1.5984 it/s=344.4
[e1 b925/2315] loss=1.4641 avg=1.5955 it/s=345.8
[e1 b1156/2315] loss=1.5447 avg=1.5938 it/s=347.3
[e1 b1387/2315] loss=1.6746 avg=1.5910 it/s=349.5
[e1 b1618/2315] loss=1.6496 avg=1.5922 it/s=350.4
[e1 b1849/2315] loss=1.6432 avg=1.5912 it/s=349.2
[e1 b2080/2315] loss=1.5245 avg=1.5895 it/s=345.0
[e1 b2311/2315] loss=1.5857 avg=1.5882 it/s=341.3


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 1/12 | loss=1.5880 | val_acc=0.2775 | val_f1=0.0869 | time=113.5s
[e2 b1/2315] loss=1.5201 avg=1.5201 it/s=311.1
[e2 b2/2315] loss=1.6033 avg=1.5617 it/s=336.2


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e2 b232/2315] loss=1.6502 avg=1.5756 it/s=340.6
[e2 b463/2315] loss=1.5309 avg=1.5778 it/s=337.2
[e2 b694/2315] loss=1.7190 avg=1.5782 it/s=328.7
[e2 b925/2315] loss=1.5597 avg=1.5767 it/s=329.7
[e2 b1156/2315] loss=1.6124 avg=1.5768 it/s=324.1
[e2 b1387/2315] loss=1.5072 avg=1.5772 it/s=328.1
[e2 b1618/2315] loss=1.4943 avg=1.5773 it/s=332.0
[e2 b1849/2315] loss=1.5887 avg=1.5768 it/s=335.3
[e2 b2080/2315] loss=1.6874 avg=1.5760 it/s=334.1
[e2 b2311/2315] loss=1.5985 avg=1.5753 it/s=331.7


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 2/12 | loss=1.5752 | val_acc=0.2775 | val_f1=0.0869 | time=116.6s
[e3 b1/2315] loss=1.5978 avg=1.5978 it/s=273.8
[e3 b2/2315] loss=1.6064 avg=1.6021 it/s=302.6


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e3 b232/2315] loss=1.5817 avg=1.5805 it/s=298.1
[e3 b463/2315] loss=1.4832 avg=1.5782 it/s=310.7
[e3 b694/2315] loss=1.5686 avg=1.5771 it/s=312.8
[e3 b925/2315] loss=1.7594 avg=1.5756 it/s=317.3
[e3 b1156/2315] loss=1.5554 avg=1.5750 it/s=319.6
[e3 b1387/2315] loss=1.6730 avg=1.5743 it/s=324.2
[e3 b1618/2315] loss=1.3810 avg=1.5749 it/s=327.7
[e3 b1849/2315] loss=1.6002 avg=1.5757 it/s=330.1
[e3 b2080/2315] loss=1.4804 avg=1.5749 it/s=329.3
[e3 b2311/2315] loss=1.6330 avg=1.5754 it/s=329.5


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 3/12 | loss=1.5755 | val_acc=0.2775 | val_f1=0.0869 | time=117.4s
[e4 b1/2315] loss=1.4830 avg=1.4830 it/s=256.2
[e4 b2/2315] loss=1.7079 avg=1.5955 it/s=282.9


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e4 b232/2315] loss=1.6338 avg=1.5798 it/s=331.5
[e4 b463/2315] loss=1.5332 avg=1.5732 it/s=338.5
[e4 b694/2315] loss=1.6436 avg=1.5747 it/s=343.1
[e4 b925/2315] loss=1.6740 avg=1.5757 it/s=344.5
[e4 b1156/2315] loss=1.6276 avg=1.5759 it/s=346.3
[e4 b1387/2315] loss=1.5473 avg=1.5751 it/s=345.7
[e4 b1618/2315] loss=1.5941 avg=1.5754 it/s=347.7
[e4 b1849/2315] loss=1.5875 avg=1.5751 it/s=349.0
[e4 b2080/2315] loss=1.4742 avg=1.5753 it/s=350.2
[e4 b2311/2315] loss=1.5502 avg=1.5753 it/s=346.5


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 4/12 | loss=1.5754 | val_acc=0.2775 | val_f1=0.0869 | time=111.9s
[e5 b1/2315] loss=1.6488 avg=1.6488 it/s=343.8
[e5 b2/2315] loss=1.5738 avg=1.6113 it/s=336.1


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e5 b232/2315] loss=1.5515 avg=1.5772 it/s=323.2
[e5 b463/2315] loss=1.5757 avg=1.5799 it/s=328.0
[e5 b694/2315] loss=1.6036 avg=1.5765 it/s=331.6
[e5 b925/2315] loss=1.5834 avg=1.5771 it/s=335.2
[e5 b1156/2315] loss=1.5903 avg=1.5754 it/s=334.8
[e5 b1387/2315] loss=1.5117 avg=1.5760 it/s=332.6
[e5 b1618/2315] loss=1.6593 avg=1.5759 it/s=334.2
[e5 b1849/2315] loss=1.5279 avg=1.5755 it/s=335.4
[e5 b2080/2315] loss=1.4750 avg=1.5754 it/s=335.6
[e5 b2311/2315] loss=1.5521 avg=1.5753 it/s=335.5


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
[34m[1mwandb[0m: [32m[41mERROR[0m The nbformat package was not found. It is required to save notebook history.


Early stopping at epoch 5


0,1
epoch,▁▁▁▁▁▁▁▃▃▃▃▃▃▃▃▃▃▅▅▅▅▅▅▅▅▆▆▆▆▆▆▆▆▆██████
lr,█▆▅▃▁
params/ratio,▁
params/total,▁
params/trainable,▁
step,▁▁▁▁▁▁▂▂▂▂▂▂▂▁▁▁▁▂▂▂▄▄▁▁▄▂▂▅▁▁▂▂▂▂▇▂▂█▂▂
time/epoch_sec,▃▇█▁▆
train/avg_loss_so_far,▆▆▆▆▆▆▅▄▅▅▅▅▅▅▆▅▅▅▅▅▁▆▅▅▅▅▅▅▅▅█▆▅▅▅▅▅▅▅▅
train/epoch_loss,█▁▁▁▁
train/items_per_sec,▁███████▇█▇▇█▇▇▇▇▇▇▇▇▅▆▇████████▇▇▇█████

0,1
best_checkpoint_path,checkpoints\best_mic...
best_val_f1,0.08688
epoch,5
lr,0.00198
params/ratio,0.30719
params/total,278813189
params/trainable,85648901
step,11571
time/epoch_sec,115.44319
train/avg_loss_so_far,1.57533


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
Best trial: 7. Best value: 0.866111:  50%|█████     | 10/20 [1:57:12<2:19:14, 835.48s/it]

[Trial 9] f1=0.0869 | unfreeze_k=12 lr=3.20e-03 wd=2.1e-05 suggested_bs=16
[I 2025-08-16 04:20:43,670] Trial 9 finished with value: 0.08687713959680486 and parameters: {'num_unfreeze_last_layers': 12, 'lr': 0.00319802754676831, 'weight_decay': 2.0716735238547033e-05, 'batch_size': 16}. Best is trial 7 with value: 0.8661112815752661.


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/mdeberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  scaler = GradScaler(enabled=(DEVICE == "cuda" and USE_AMP))


Trainable params: 57,297,413 / 278,813,189 (20.55%) ; unfreeze_last_k=8
[e1 b1/2315] loss=1.6636 avg=1.6636 it/s=110.5
[e1 b2/2315] loss=1.6748 avg=1.6692 it/s=165.4


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e1 b232/2315] loss=1.3313 avg=1.5650 it/s=388.6
[e1 b463/2315] loss=1.4368 avg=1.4286 it/s=400.6
[e1 b694/2315] loss=0.9870 avg=1.3233 it/s=408.8
[e1 b925/2315] loss=1.1821 avg=1.2514 it/s=410.9
[e1 b1156/2315] loss=1.2061 avg=1.2186 it/s=413.0
[e1 b1387/2315] loss=1.2778 avg=1.2372 it/s=413.5
[e1 b1618/2315] loss=1.6307 avg=1.2833 it/s=414.0
[e1 b1849/2315] loss=1.6043 avg=1.3209 it/s=415.9
[e1 b2080/2315] loss=1.7989 avg=1.3499 it/s=417.4
[e1 b2311/2315] loss=1.6437 avg=1.3731 it/s=418.6


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 1/12 | loss=1.3733 | val_acc=0.2410 | val_f1=0.0777 | time=93.2s
[e2 b1/2315] loss=1.5059 avg=1.5059 it/s=388.7
[e2 b2/2315] loss=1.6141 avg=1.5600 it/s=411.6


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e2 b232/2315] loss=1.6005 avg=1.5830 it/s=402.9
[e2 b463/2315] loss=1.5913 avg=1.5815 it/s=397.6
[e2 b694/2315] loss=1.6431 avg=1.5817 it/s=388.4
[e2 b925/2315] loss=1.5699 avg=1.5801 it/s=387.3
[e2 b1156/2315] loss=1.5077 avg=1.5797 it/s=388.5
[e2 b1387/2315] loss=1.6374 avg=1.5790 it/s=391.2
[e2 b1618/2315] loss=1.4535 avg=1.5786 it/s=391.7
[e2 b1849/2315] loss=1.5939 avg=1.5789 it/s=393.2
[e2 b2080/2315] loss=1.6545 avg=1.5779 it/s=394.0
[e2 b2311/2315] loss=1.4607 avg=1.5776 it/s=396.6


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 2/12 | loss=1.5777 | val_acc=0.2775 | val_f1=0.0869 | time=98.2s
[e3 b1/2315] loss=1.5554 avg=1.5554 it/s=444.6
[e3 b2/2315] loss=1.5894 avg=1.5724 it/s=456.3


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e3 b232/2315] loss=1.6224 avg=1.5784 it/s=435.6
[e3 b463/2315] loss=1.5086 avg=1.5763 it/s=435.9
[e3 b694/2315] loss=1.5520 avg=1.5742 it/s=434.5
[e3 b925/2315] loss=1.5960 avg=1.5748 it/s=422.0
[e3 b1156/2315] loss=1.6052 avg=1.5765 it/s=410.0
[e3 b1387/2315] loss=1.6347 avg=1.5765 it/s=403.1
[e3 b1618/2315] loss=1.7059 avg=1.5767 it/s=404.1
[e3 b1849/2315] loss=1.4278 avg=1.5759 it/s=407.8
[e3 b2080/2315] loss=1.6579 avg=1.5762 it/s=409.1
[e3 b2311/2315] loss=1.5522 avg=1.5765 it/s=408.9


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 3/12 | loss=1.5765 | val_acc=0.2775 | val_f1=0.0869 | time=95.4s
[e4 b1/2315] loss=1.5380 avg=1.5380 it/s=424.7
[e4 b2/2315] loss=1.5323 avg=1.5351 it/s=398.8


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e4 b232/2315] loss=1.7128 avg=1.5796 it/s=407.7
[e4 b463/2315] loss=1.5160 avg=1.5779 it/s=410.7
[e4 b694/2315] loss=1.5462 avg=1.5777 it/s=421.8
[e4 b925/2315] loss=1.5704 avg=1.5775 it/s=427.8
[e4 b1156/2315] loss=1.5787 avg=1.5770 it/s=428.4
[e4 b1387/2315] loss=1.6296 avg=1.5775 it/s=429.5
[e4 b1618/2315] loss=1.5998 avg=1.5761 it/s=422.5
[e4 b1849/2315] loss=1.5650 avg=1.5754 it/s=417.1
[e4 b2080/2315] loss=1.5767 avg=1.5757 it/s=415.5
[e4 b2311/2315] loss=1.5435 avg=1.5761 it/s=416.6


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 4/12 | loss=1.5760 | val_acc=0.2775 | val_f1=0.0869 | time=93.6s
[e5 b1/2315] loss=1.6385 avg=1.6385 it/s=587.2
[e5 b2/2315] loss=1.5489 avg=1.5937 it/s=462.8


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e5 b232/2315] loss=1.5125 avg=1.5832 it/s=413.6
[e5 b463/2315] loss=1.6174 avg=1.5794 it/s=405.0
[e5 b694/2315] loss=1.6764 avg=1.5779 it/s=407.4
[e5 b925/2315] loss=1.5786 avg=1.5761 it/s=412.3
[e5 b1156/2315] loss=1.6209 avg=1.5764 it/s=412.9
[e5 b1387/2315] loss=1.6021 avg=1.5752 it/s=417.3
[e5 b1618/2315] loss=1.5537 avg=1.5755 it/s=419.3
[e5 b1849/2315] loss=1.4961 avg=1.5761 it/s=421.0
[e5 b2080/2315] loss=1.5363 avg=1.5759 it/s=417.8
[e5 b2311/2315] loss=1.5471 avg=1.5759 it/s=413.5


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 5/12 | loss=1.5759 | val_acc=0.2775 | val_f1=0.0869 | time=94.6s
[e6 b1/2315] loss=1.5497 avg=1.5497 it/s=371.5
[e6 b2/2315] loss=1.6143 avg=1.5820 it/s=498.3


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e6 b232/2315] loss=1.6004 avg=1.5775 it/s=379.4
[e6 b463/2315] loss=1.5904 avg=1.5755 it/s=396.4
[e6 b694/2315] loss=1.4713 avg=1.5736 it/s=404.1
[e6 b925/2315] loss=1.5444 avg=1.5754 it/s=408.1
[e6 b1156/2315] loss=1.5416 avg=1.5745 it/s=412.3
[e6 b1387/2315] loss=1.4810 avg=1.5748 it/s=414.1
[e6 b1618/2315] loss=1.5129 avg=1.5751 it/s=414.1
[e6 b1849/2315] loss=1.5295 avg=1.5754 it/s=412.5
[e6 b2080/2315] loss=1.5778 avg=1.5754 it/s=416.3
[e6 b2311/2315] loss=1.5354 avg=1.5757 it/s=419.1


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
[34m[1mwandb[0m: [32m[41mERROR[0m The nbformat package was not found. It is required to save notebook history.


Early stopping at epoch 6


0,1
epoch,▁▁▁▁▁▁▁▁▂▂▂▂▂▂▂▂▂▄▄▄▄▄▄▄▅▅▅▅▅▇▇▇▇▇▇█████
lr,█▇▅▄▂▁
params/ratio,▁
params/total,▁
params/trainable,▁
step,▁▁▂▂▂▁▂▂▂▂▂▃▁▅▂▂▁▂▂▂▂▂▇▃▃▁▁█▂▂▂▂▂▁▁▂▂▂▂▂
time/epoch_sec,▁█▄▂▃▁
train/avg_loss_so_far,██▄▂▁▂▃▆▇▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▇▇▆▆▆▆▆▆▆▆▆▆▆▆▆
train/epoch_loss,▁█████
train/items_per_sec,▁▅▅▅▅▅▅▅▅▅▅▅▆▆▆▅▅▅▅▅▅▆▆▆▆▅█▆▅▅▆▆▅▇▅▅▅▅▅▆

0,1
best_checkpoint_path,checkpoints\best_mic...
best_val_f1,0.08688
epoch,6
lr,0.00017
params/ratio,0.2055
params/total,278813189
params/trainable,57297413
step,13886
time/epoch_sec,93.10518
train/avg_loss_so_far,1.57571


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
Best trial: 7. Best value: 0.866111:  55%|█████▌    | 11/20 [2:06:51<1:53:33, 757.01s/it]

[Trial 10] f1=0.0869 | unfreeze_k=8 lr=3.12e-04 wd=8.9e-05 suggested_bs=64
[I 2025-08-16 04:30:22,761] Trial 10 finished with value: 0.08687713959680486 and parameters: {'num_unfreeze_last_layers': 8, 'lr': 0.00031227487155926006, 'weight_decay': 8.926712608942563e-05, 'batch_size': 64}. Best is trial 7 with value: 0.8661112815752661.


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/mdeberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  scaler = GradScaler(enabled=(DEVICE == "cuda" and USE_AMP))


Trainable params: 57,297,413 / 278,813,189 (20.55%) ; unfreeze_last_k=8
[e1 b1/2315] loss=1.5741 avg=1.5741 it/s=260.9
[e1 b2/2315] loss=1.6420 avg=1.6081 it/s=282.5


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e1 b232/2315] loss=1.3953 avg=1.5713 it/s=395.1
[e1 b463/2315] loss=1.1916 avg=1.4826 it/s=388.3
[e1 b694/2315] loss=0.9879 avg=1.3671 it/s=388.5
[e1 b925/2315] loss=0.7412 avg=1.2712 it/s=396.2
[e1 b1156/2315] loss=0.8679 avg=1.2049 it/s=400.4
[e1 b1387/2315] loss=1.0370 avg=1.1525 it/s=405.0
[e1 b1618/2315] loss=1.0214 avg=1.1094 it/s=408.0
[e1 b1849/2315] loss=0.9011 avg=1.0775 it/s=409.5
[e1 b2080/2315] loss=0.5252 avg=1.0449 it/s=409.6
[e1 b2311/2315] loss=0.3931 avg=1.0165 it/s=409.9


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 1/12 | loss=1.0161 | val_acc=0.7128 | val_f1=0.7229 | time=95.1s
[e2 b1/2315] loss=0.3638 avg=0.3638 it/s=399.2
[e2 b2/2315] loss=0.7488 avg=0.5563 it/s=411.0


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e2 b232/2315] loss=0.4332 avg=0.7298 it/s=410.1
[e2 b463/2315] loss=1.1351 avg=0.7072 it/s=418.6
[e2 b694/2315] loss=0.5595 avg=0.6999 it/s=420.2
[e2 b925/2315] loss=0.5168 avg=0.6921 it/s=410.9
[e2 b1156/2315] loss=0.7191 avg=0.6885 it/s=409.3
[e2 b1387/2315] loss=0.4723 avg=0.6818 it/s=407.2
[e2 b1618/2315] loss=0.6468 avg=0.6701 it/s=407.9
[e2 b1849/2315] loss=0.2358 avg=0.6668 it/s=405.0
[e2 b2080/2315] loss=0.8136 avg=0.6600 it/s=404.9
[e2 b2311/2315] loss=0.2666 avg=0.6550 it/s=407.5


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 2/12 | loss=0.6551 | val_acc=0.7942 | val_f1=0.8000 | time=95.7s
[e3 b1/2315] loss=0.5782 avg=0.5782 it/s=417.7
[e3 b2/2315] loss=0.4999 avg=0.5390 it/s=374.3


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e3 b232/2315] loss=0.4458 avg=0.5317 it/s=403.9
[e3 b463/2315] loss=0.3514 avg=0.5489 it/s=409.4
[e3 b694/2315] loss=0.5925 avg=0.5462 it/s=417.8
[e3 b925/2315] loss=0.4076 avg=0.5447 it/s=423.0
[e3 b1156/2315] loss=0.3250 avg=0.5461 it/s=426.2
[e3 b1387/2315] loss=0.5199 avg=0.5432 it/s=422.3
[e3 b1618/2315] loss=1.4349 avg=0.5427 it/s=415.8
[e3 b1849/2315] loss=0.5430 avg=0.5354 it/s=411.0
[e3 b2080/2315] loss=0.1724 avg=0.5356 it/s=408.5
[e3 b2311/2315] loss=0.6266 avg=0.5324 it/s=406.6


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 3/12 | loss=0.5324 | val_acc=0.8190 | val_f1=0.8242 | time=96.1s
[e4 b1/2315] loss=0.2065 avg=0.2065 it/s=461.3
[e4 b2/2315] loss=0.4936 avg=0.3501 it/s=454.8


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e4 b232/2315] loss=0.5218 avg=0.4569 it/s=402.4
[e4 b463/2315] loss=0.6237 avg=0.4604 it/s=410.5
[e4 b694/2315] loss=0.4120 avg=0.4652 it/s=411.7
[e4 b925/2315] loss=0.3373 avg=0.4652 it/s=415.2
[e4 b1156/2315] loss=0.0364 avg=0.4617 it/s=418.2
[e4 b1387/2315] loss=0.2746 avg=0.4592 it/s=422.1
[e4 b1618/2315] loss=0.1483 avg=0.4541 it/s=426.1
[e4 b1849/2315] loss=0.8854 avg=0.4522 it/s=428.5
[e4 b2080/2315] loss=0.5785 avg=0.4503 it/s=424.4
[e4 b2311/2315] loss=0.2212 avg=0.4481 it/s=422.7


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 4/12 | loss=0.4478 | val_acc=0.8017 | val_f1=0.8080 | time=92.3s
[e5 b1/2315] loss=0.8688 avg=0.8688 it/s=686.0
[e5 b2/2315] loss=0.1497 avg=0.5092 it/s=421.4


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e5 b232/2315] loss=0.2904 avg=0.3888 it/s=422.8
[e5 b463/2315] loss=0.5322 avg=0.3783 it/s=428.4
[e5 b694/2315] loss=0.3977 avg=0.3754 it/s=429.4
[e5 b925/2315] loss=0.5104 avg=0.3847 it/s=431.0
[e5 b1156/2315] loss=0.4124 avg=0.3881 it/s=430.3
[e5 b1387/2315] loss=0.2136 avg=0.3900 it/s=425.4
[e5 b1618/2315] loss=0.3838 avg=0.3909 it/s=421.1
[e5 b1849/2315] loss=0.2131 avg=0.3899 it/s=421.3
[e5 b2080/2315] loss=0.4790 avg=0.3881 it/s=424.0
[e5 b2311/2315] loss=0.5981 avg=0.3885 it/s=426.0


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 5/12 | loss=0.3882 | val_acc=0.8440 | val_f1=0.8475 | time=91.6s
[e6 b1/2315] loss=0.6455 avg=0.6455 it/s=381.5
[e6 b2/2315] loss=0.2859 avg=0.4657 it/s=390.7


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e6 b232/2315] loss=0.0860 avg=0.3308 it/s=414.8
[e6 b463/2315] loss=0.1791 avg=0.3373 it/s=415.4
[e6 b694/2315] loss=0.8090 avg=0.3267 it/s=416.5
[e6 b925/2315] loss=0.4906 avg=0.3299 it/s=412.7
[e6 b1156/2315] loss=0.2831 avg=0.3296 it/s=410.0
[e6 b1387/2315] loss=0.2883 avg=0.3285 it/s=412.5
[e6 b1618/2315] loss=0.4220 avg=0.3294 it/s=410.7
[e6 b1849/2315] loss=0.0597 avg=0.3316 it/s=409.3
[e6 b2080/2315] loss=0.0642 avg=0.3345 it/s=409.1
[e6 b2311/2315] loss=0.0537 avg=0.3320 it/s=411.1


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 6/12 | loss=0.3320 | val_acc=0.8397 | val_f1=0.8445 | time=95.0s
[e7 b1/2315] loss=0.0729 avg=0.0729 it/s=408.7
[e7 b2/2315] loss=0.0744 avg=0.0736 it/s=479.2


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e7 b232/2315] loss=0.4538 avg=0.3061 it/s=433.9
[e7 b463/2315] loss=0.0808 avg=0.2960 it/s=434.0
[e7 b694/2315] loss=0.0265 avg=0.2901 it/s=432.9
[e7 b925/2315] loss=0.4768 avg=0.2944 it/s=425.0
[e7 b1156/2315] loss=0.8689 avg=0.2926 it/s=414.4
[e7 b1387/2315] loss=0.2043 avg=0.2923 it/s=406.4
[e7 b1618/2315] loss=0.2696 avg=0.2909 it/s=405.6
[e7 b1849/2315] loss=0.4585 avg=0.2918 it/s=407.0
[e7 b2080/2315] loss=0.0391 avg=0.2902 it/s=409.1
[e7 b2311/2315] loss=0.1793 avg=0.2902 it/s=408.2


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 7/12 | loss=0.2902 | val_acc=0.8550 | val_f1=0.8597 | time=95.5s
[e8 b1/2315] loss=0.2687 avg=0.2687 it/s=397.1
[e8 b2/2315] loss=0.1744 avg=0.2216 it/s=392.7


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e8 b232/2315] loss=0.1471 avg=0.2458 it/s=416.9
[e8 b463/2315] loss=0.0191 avg=0.2562 it/s=418.8
[e8 b694/2315] loss=0.0422 avg=0.2528 it/s=424.5
[e8 b925/2315] loss=0.4137 avg=0.2464 it/s=431.4
[e8 b1156/2315] loss=0.2493 avg=0.2445 it/s=433.9
[e8 b1387/2315] loss=0.0352 avg=0.2437 it/s=436.3
[e8 b1618/2315] loss=0.0326 avg=0.2440 it/s=431.9
[e8 b1849/2315] loss=0.4169 avg=0.2418 it/s=425.3
[e8 b2080/2315] loss=0.3074 avg=0.2430 it/s=419.4
[e8 b2311/2315] loss=0.0512 avg=0.2438 it/s=419.0


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 8/12 | loss=0.2437 | val_acc=0.8499 | val_f1=0.8555 | time=93.1s
[e9 b1/2315] loss=0.2563 avg=0.2563 it/s=729.0
[e9 b2/2315] loss=0.8885 avg=0.5724 it/s=439.8


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e9 b232/2315] loss=0.2576 avg=0.2083 it/s=430.5
[e9 b463/2315] loss=0.0231 avg=0.2012 it/s=422.5
[e9 b694/2315] loss=0.0220 avg=0.2075 it/s=409.6
[e9 b925/2315] loss=0.2449 avg=0.2106 it/s=407.3
[e9 b1156/2315] loss=0.0132 avg=0.2138 it/s=408.5
[e9 b1387/2315] loss=0.5304 avg=0.2102 it/s=410.8
[e9 b1618/2315] loss=0.2831 avg=0.2099 it/s=414.0
[e9 b1849/2315] loss=0.2791 avg=0.2119 it/s=417.7
[e9 b2080/2315] loss=0.3622 avg=0.2116 it/s=420.9
[e9 b2311/2315] loss=0.6624 avg=0.2124 it/s=418.9


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 9/12 | loss=0.2122 | val_acc=0.8681 | val_f1=0.8720 | time=93.5s
[e10 b1/2315] loss=0.2775 avg=0.2775 it/s=267.2
[e10 b2/2315] loss=0.0265 avg=0.1520 it/s=327.4


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e10 b232/2315] loss=0.0204 avg=0.1572 it/s=382.6
[e10 b463/2315] loss=0.3251 avg=0.1789 it/s=389.6
[e10 b694/2315] loss=0.4267 avg=0.1773 it/s=389.9
[e10 b925/2315] loss=0.0211 avg=0.1761 it/s=385.1
[e10 b1156/2315] loss=0.4688 avg=0.1735 it/s=386.7
[e10 b1387/2315] loss=0.4329 avg=0.1698 it/s=391.9
[e10 b1618/2315] loss=0.8178 avg=0.1714 it/s=395.4
[e10 b1849/2315] loss=0.3732 avg=0.1713 it/s=398.1
[e10 b2080/2315] loss=0.4821 avg=0.1709 it/s=403.1
[e10 b2311/2315] loss=0.0099 avg=0.1702 it/s=407.0


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 10/12 | loss=0.1703 | val_acc=0.8664 | val_f1=0.8705 | time=95.7s
[e11 b1/2315] loss=0.0039 avg=0.0039 it/s=439.6
[e11 b2/2315] loss=0.0078 avg=0.0059 it/s=451.2


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e11 b232/2315] loss=0.0042 avg=0.1478 it/s=435.4
[e11 b463/2315] loss=0.0126 avg=0.1369 it/s=405.8
[e11 b694/2315] loss=0.0032 avg=0.1444 it/s=394.2
[e11 b925/2315] loss=0.3590 avg=0.1409 it/s=389.6
[e11 b1156/2315] loss=0.0074 avg=0.1416 it/s=398.0
[e11 b1387/2315] loss=0.1257 avg=0.1442 it/s=402.3
[e11 b1618/2315] loss=0.3514 avg=0.1447 it/s=403.2
[e11 b1849/2315] loss=0.0131 avg=0.1476 it/s=404.9
[e11 b2080/2315] loss=0.0051 avg=0.1499 it/s=407.8
[e11 b2311/2315] loss=0.0075 avg=0.1483 it/s=408.2


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 11/12 | loss=0.1485 | val_acc=0.8635 | val_f1=0.8677 | time=95.5s
[e12 b1/2315] loss=0.0049 avg=0.0049 it/s=429.2
[e12 b2/2315] loss=0.0168 avg=0.0108 it/s=407.1


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e12 b232/2315] loss=0.0022 avg=0.1151 it/s=432.7
[e12 b463/2315] loss=0.0072 avg=0.1215 it/s=439.8
[e12 b694/2315] loss=0.0030 avg=0.1125 it/s=442.1
[e12 b925/2315] loss=0.0030 avg=0.1185 it/s=444.1
[e12 b1156/2315] loss=0.0032 avg=0.1247 it/s=432.6
[e12 b1387/2315] loss=0.0955 avg=0.1224 it/s=424.3
[e12 b1618/2315] loss=0.4672 avg=0.1232 it/s=421.1
[e12 b1849/2315] loss=0.0052 avg=0.1218 it/s=422.5
[e12 b2080/2315] loss=0.0149 avg=0.1216 it/s=422.3
[e12 b2311/2315] loss=0.0031 avg=0.1207 it/s=419.8


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
[34m[1mwandb[0m: [32m[41mERROR[0m The nbformat package was not found. It is required to save notebook history.


Epoch 12/12 | loss=0.1207 | val_acc=0.8688 | val_f1=0.8732 | time=93.1s


0,1
epoch,▁▁▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▄▄▄▄▅▅▅▅▅▅▅▅▅▆▇▇▇▇▇▇▇██
lr,█▇▇▆▅▅▄▄▃▂▂▁
params/ratio,▁
params/total,▁
params/trainable,▁
step,▁▁▁▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▂▁▁▁▁▁▆▁▁▁▁▇▁▁▂█▂
time/epoch_sec,▆▇█▂▁▆▇▃▄▇▇▃
train/avg_loss_so_far,█▇▇▆▆▅▄▄▄▃▃▃▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▁▂▂
train/epoch_loss,█▅▄▄▃▃▂▂▂▁▁▁
train/items_per_sec,▂▂▂▂▂▂▂▂▂▃▂█▂▂▂▂▁▂▂▂▃▂▂▂▂▁▂▂▂▂▁▁▁▁▂▁▂▂▂▂

0,1
best_checkpoint_path,checkpoints\best_mic...
best_val_f1,0.87317
epoch,12
lr,0
params/ratio,0.2055
params/total,278813189
params/trainable,57297413
step,27776
time/epoch_sec,93.10025
train/avg_loss_so_far,0.12071


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
Best trial: 11. Best value: 0.873175:  60%|██████    | 12/20 [2:26:00<1:56:48, 876.04s/it]

[Trial 11] f1=0.8732 | unfreeze_k=8 lr=1.06e-04 wd=3.9e-06 suggested_bs=64
[I 2025-08-16 04:49:31,033] Trial 11 finished with value: 0.873174802426709 and parameters: {'num_unfreeze_last_layers': 8, 'lr': 0.0001061735334192977, 'weight_decay': 3.9302980013836756e-06, 'batch_size': 64}. Best is trial 11 with value: 0.873174802426709.


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/mdeberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  scaler = GradScaler(enabled=(DEVICE == "cuda" and USE_AMP))


Trainable params: 57,297,413 / 278,813,189 (20.55%) ; unfreeze_last_k=8
[e1 b1/2315] loss=1.6031 avg=1.6031 it/s=195.4
[e1 b2/2315] loss=1.6014 avg=1.6023 it/s=245.6


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e1 b232/2315] loss=1.4609 avg=1.5490 it/s=381.3
[e1 b463/2315] loss=1.1682 avg=1.3954 it/s=391.7
[e1 b694/2315] loss=0.3986 avg=1.2816 it/s=400.0
[e1 b925/2315] loss=0.9053 avg=1.2154 it/s=405.5
[e1 b1156/2315] loss=1.1431 avg=1.1746 it/s=410.9
[e1 b1387/2315] loss=0.7161 avg=1.1443 it/s=413.3
[e1 b1618/2315] loss=1.4147 avg=1.1236 it/s=408.9
[e1 b1849/2315] loss=0.8793 avg=1.1067 it/s=405.9
[e1 b2080/2315] loss=1.2571 avg=1.0936 it/s=404.0
[e1 b2311/2315] loss=1.2362 avg=1.0809 it/s=403.1


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 1/12 | loss=1.0805 | val_acc=0.6331 | val_f1=0.6324 | time=96.9s
[e2 b1/2315] loss=0.8607 avg=0.8607 it/s=440.6
[e2 b2/2315] loss=1.0145 avg=0.9376 it/s=387.0


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e2 b232/2315] loss=1.1006 avg=1.0439 it/s=382.1
[e2 b463/2315] loss=0.9930 avg=1.1233 it/s=391.6
[e2 b694/2315] loss=1.6917 avg=1.1545 it/s=401.0
[e2 b925/2315] loss=1.4815 avg=1.1952 it/s=400.8
[e2 b1156/2315] loss=1.5614 avg=1.2741 it/s=402.0
[e2 b1387/2315] loss=1.5988 avg=1.3264 it/s=406.1
[e2 b1618/2315] loss=1.6429 avg=1.3617 it/s=410.2
[e2 b1849/2315] loss=1.5705 avg=1.3877 it/s=412.8
[e2 b2080/2315] loss=1.6240 avg=1.4094 it/s=407.0
[e2 b2311/2315] loss=1.7310 avg=1.4266 it/s=403.0


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 2/12 | loss=1.4266 | val_acc=0.2775 | val_f1=0.0869 | time=97.0s
[e3 b1/2315] loss=1.7584 avg=1.7584 it/s=446.1
[e3 b2/2315] loss=1.6628 avg=1.7106 it/s=399.8


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e3 b232/2315] loss=1.5122 avg=1.5793 it/s=407.5
[e3 b463/2315] loss=1.4597 avg=1.5777 it/s=424.2
[e3 b694/2315] loss=1.6354 avg=1.5791 it/s=424.6
[e3 b925/2315] loss=1.6445 avg=1.5785 it/s=425.5
[e3 b1156/2315] loss=1.5605 avg=1.5783 it/s=425.6
[e3 b1387/2315] loss=1.6008 avg=1.5778 it/s=423.5
[e3 b1618/2315] loss=1.5495 avg=1.5779 it/s=420.4
[e3 b1849/2315] loss=1.5468 avg=1.5774 it/s=422.8
[e3 b2080/2315] loss=1.7009 avg=1.5769 it/s=424.8
[e3 b2311/2315] loss=1.5441 avg=1.5770 it/s=426.6


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 3/12 | loss=1.5770 | val_acc=0.2775 | val_f1=0.0869 | time=91.6s
[e4 b1/2315] loss=1.5588 avg=1.5588 it/s=648.3
[e4 b2/2315] loss=1.4715 avg=1.5151 it/s=566.5


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e4 b232/2315] loss=1.4596 avg=1.5695 it/s=394.2
[e4 b463/2315] loss=1.5038 avg=1.5727 it/s=386.2
[e4 b694/2315] loss=1.4902 avg=1.5774 it/s=387.2
[e4 b925/2315] loss=1.5758 avg=1.5766 it/s=393.0
[e4 b1156/2315] loss=1.5407 avg=1.5767 it/s=398.9
[e4 b1387/2315] loss=1.6416 avg=1.5760 it/s=404.0
[e4 b1618/2315] loss=1.5858 avg=1.5756 it/s=407.9
[e4 b1849/2315] loss=1.6745 avg=1.5752 it/s=409.9
[e4 b2080/2315] loss=1.5896 avg=1.5755 it/s=409.0
[e4 b2311/2315] loss=1.6489 avg=1.5760 it/s=409.0


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 4/12 | loss=1.5760 | val_acc=0.2775 | val_f1=0.0869 | time=95.4s
[e5 b1/2315] loss=1.6295 avg=1.6295 it/s=441.4
[e5 b2/2315] loss=1.5107 avg=1.5701 it/s=457.5


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e5 b232/2315] loss=1.6117 avg=1.5742 it/s=448.8
[e5 b463/2315] loss=1.6349 avg=1.5755 it/s=451.8
[e5 b694/2315] loss=1.5316 avg=1.5770 it/s=449.1
[e5 b925/2315] loss=1.5955 avg=1.5771 it/s=440.6
[e5 b1156/2315] loss=1.6136 avg=1.5775 it/s=435.5
[e5 b1387/2315] loss=1.5701 avg=1.5770 it/s=428.3
[e5 b1618/2315] loss=1.5398 avg=1.5771 it/s=422.5
[e5 b1849/2315] loss=1.5450 avg=1.5766 it/s=415.2
[e5 b2080/2315] loss=1.5623 avg=1.5763 it/s=411.9
[e5 b2311/2315] loss=1.5766 avg=1.5760 it/s=411.3


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
[34m[1mwandb[0m: [32m[41mERROR[0m The nbformat package was not found. It is required to save notebook history.


Early stopping at epoch 5


0,1
epoch,▁▁▁▁▃▃▃▃▃▅▅▅▅▅▅▅▅▅▆▆▆▆▆▆▆▆▆▆████████████
lr,█▆▅▃▁
params/ratio,▁
params/total,▁
params/trainable,▁
step,▁▁▁▁▂▂▂▂▂▂▃▁▂▂▂▃▃▁▁▁▂▂▂▂▂▂▂▆▁▂▂▂▂▂▂█▁▂▂▃
time/epoch_sec,██▁▆▆
train/avg_loss_so_far,▇▆▅▄▄▃▃▃▃▁▂▃▃▄▅▅▅██▇▇▇▇▇▇▇▆▇▇▇▇▇▇▇▇▇▇▇▇▇
train/epoch_loss,▁▆███
train/items_per_sec,▁▂▄▄▄▄▄▄▄▅▄▄▄▄▄▄▄▅▄▄▅▅▅▅▅▅█▇▄▄▄▄▄▄▅▅▅▅▄▄

0,1
best_checkpoint_path,checkpoints\best_mic...
best_val_f1,0.63235
epoch,5
lr,0.00012
params/ratio,0.2055
params/total,278813189
params/trainable,57297413
step,11571
time/epoch_sec,95.08015
train/avg_loss_so_far,1.57602


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
Best trial: 11. Best value: 0.873175:  65%|██████▌   | 13/20 [2:34:05<1:28:24, 757.76s/it]

[Trial 12] f1=0.6324 | unfreeze_k=8 lr=1.88e-04 wd=4.6e-06 suggested_bs=64
[I 2025-08-16 04:57:36,613] Trial 12 finished with value: 0.6323546484809632 and parameters: {'num_unfreeze_last_layers': 8, 'lr': 0.00018765508178568579, 'weight_decay': 4.610599546574859e-06, 'batch_size': 64}. Best is trial 11 with value: 0.873174802426709.


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/mdeberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  scaler = GradScaler(enabled=(DEVICE == "cuda" and USE_AMP))


Trainable params: 64,385,285 / 278,813,189 (23.09%) ; unfreeze_last_k=9
[e1 b1/2315] loss=1.6643 avg=1.6643 it/s=138.1
[e1 b2/2315] loss=1.6148 avg=1.6396 it/s=204.0


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e1 b232/2315] loss=1.5087 avg=1.5796 it/s=399.6
[e1 b463/2315] loss=1.0805 avg=1.4803 it/s=404.8
[e1 b694/2315] loss=0.7252 avg=1.3590 it/s=405.9
[e1 b925/2315] loss=0.9780 avg=1.2641 it/s=405.2
[e1 b1156/2315] loss=0.9122 avg=1.1943 it/s=396.7
[e1 b1387/2315] loss=1.5636 avg=1.1391 it/s=390.1
[e1 b1618/2315] loss=0.6595 avg=1.0977 it/s=387.3
[e1 b1849/2315] loss=0.7155 avg=1.0691 it/s=386.9
[e1 b2080/2315] loss=0.9394 avg=1.0399 it/s=386.4
[e1 b2311/2315] loss=0.6763 avg=1.0142 it/s=386.2


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 1/12 | loss=1.0139 | val_acc=0.7004 | val_f1=0.7132 | time=100.9s
[e2 b1/2315] loss=0.9537 avg=0.9537 it/s=409.4
[e2 b2/2315] loss=0.9445 avg=0.9491 it/s=437.7


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e2 b232/2315] loss=1.0091 avg=0.7547 it/s=413.1
[e2 b463/2315] loss=0.6166 avg=0.7336 it/s=399.2
[e2 b694/2315] loss=0.9079 avg=0.7254 it/s=396.0
[e2 b925/2315] loss=0.4089 avg=0.7285 it/s=399.0
[e2 b1156/2315] loss=0.9085 avg=0.7309 it/s=402.0
[e2 b1387/2315] loss=1.0751 avg=0.7368 it/s=403.8
[e2 b1618/2315] loss=0.4731 avg=0.7342 it/s=400.1
[e2 b1849/2315] loss=0.9180 avg=0.7310 it/s=399.4
[e2 b2080/2315] loss=0.6241 avg=0.7256 it/s=396.8
[e2 b2311/2315] loss=0.5531 avg=0.7193 it/s=395.3


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 2/12 | loss=0.7192 | val_acc=0.7536 | val_f1=0.7620 | time=98.7s
[e3 b1/2315] loss=0.7985 avg=0.7985 it/s=257.9
[e3 b2/2315] loss=0.5778 avg=0.6882 it/s=302.6


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e3 b232/2315] loss=0.5719 avg=0.6406 it/s=395.3
[e3 b463/2315] loss=0.6252 avg=0.6176 it/s=383.9
[e3 b694/2315] loss=0.4259 avg=0.6125 it/s=374.9
[e3 b925/2315] loss=1.0189 avg=0.6129 it/s=372.0
[e3 b1156/2315] loss=0.2949 avg=0.6219 it/s=375.6
[e3 b1387/2315] loss=0.1999 avg=0.6198 it/s=379.9
[e3 b1618/2315] loss=0.6791 avg=0.6140 it/s=383.8
[e3 b1849/2315] loss=0.3884 avg=0.6092 it/s=385.2
[e3 b2080/2315] loss=0.6907 avg=0.6069 it/s=385.3
[e3 b2311/2315] loss=0.9315 avg=0.6064 it/s=384.0


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 3/12 | loss=0.6062 | val_acc=0.7966 | val_f1=0.8041 | time=101.5s
[e4 b1/2315] loss=0.5959 avg=0.5959 it/s=349.5
[e4 b2/2315] loss=0.4786 avg=0.5372 it/s=386.2


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e4 b232/2315] loss=0.3838 avg=0.5613 it/s=371.3
[e4 b463/2315] loss=0.4415 avg=0.5748 it/s=373.9
[e4 b694/2315] loss=0.1319 avg=0.5640 it/s=372.9
[e4 b925/2315] loss=0.1890 avg=0.5618 it/s=370.2
[e4 b1156/2315] loss=0.2809 avg=0.5590 it/s=374.7
[e4 b1387/2315] loss=0.8235 avg=0.5545 it/s=375.9
[e4 b1618/2315] loss=0.4138 avg=0.5505 it/s=378.3
[e4 b1849/2315] loss=0.7312 avg=0.5461 it/s=382.0
[e4 b2080/2315] loss=0.5785 avg=0.5451 it/s=384.6
[e4 b2311/2315] loss=0.7393 avg=0.5443 it/s=386.1


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 4/12 | loss=0.5442 | val_acc=0.8049 | val_f1=0.8098 | time=101.0s
[e5 b1/2315] loss=0.4777 avg=0.4777 it/s=401.2
[e5 b2/2315] loss=0.3832 avg=0.4304 it/s=372.1


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e5 b232/2315] loss=0.4118 avg=0.5102 it/s=359.9
[e5 b463/2315] loss=0.5125 avg=0.4846 it/s=357.8
[e5 b694/2315] loss=0.4881 avg=0.4910 it/s=369.6
[e5 b925/2315] loss=0.7798 avg=0.4915 it/s=374.8
[e5 b1156/2315] loss=0.4517 avg=0.4869 it/s=379.9
[e5 b1387/2315] loss=0.5577 avg=0.4841 it/s=382.8
[e5 b1618/2315] loss=0.3298 avg=0.4833 it/s=383.5
[e5 b1849/2315] loss=0.3817 avg=0.4807 it/s=384.2
[e5 b2080/2315] loss=0.3213 avg=0.4802 it/s=386.4
[e5 b2311/2315] loss=0.7108 avg=0.4805 it/s=389.2


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 5/12 | loss=0.4808 | val_acc=0.7884 | val_f1=0.7979 | time=100.0s
[e6 b1/2315] loss=0.3405 avg=0.3405 it/s=488.9
[e6 b2/2315] loss=0.6285 avg=0.4845 it/s=433.0


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e6 b232/2315] loss=0.3355 avg=0.3933 it/s=416.1
[e6 b463/2315] loss=0.2684 avg=0.4156 it/s=394.0
[e6 b694/2315] loss=0.5801 avg=0.4188 it/s=380.1
[e6 b925/2315] loss=0.4097 avg=0.4182 it/s=374.2
[e6 b1156/2315] loss=0.4919 avg=0.4144 it/s=375.7
[e6 b1387/2315] loss=0.3866 avg=0.4157 it/s=378.1
[e6 b1618/2315] loss=0.3029 avg=0.4150 it/s=381.4
[e6 b1849/2315] loss=0.4736 avg=0.4146 it/s=381.8
[e6 b2080/2315] loss=0.4332 avg=0.4125 it/s=378.7
[e6 b2311/2315] loss=0.5124 avg=0.4120 it/s=378.3


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 6/12 | loss=0.4125 | val_acc=0.8251 | val_f1=0.8308 | time=102.6s
[e7 b1/2315] loss=0.0703 avg=0.0703 it/s=525.8
[e7 b2/2315] loss=0.8526 avg=0.4615 it/s=380.1


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e7 b232/2315] loss=0.2702 avg=0.4126 it/s=420.0
[e7 b463/2315] loss=0.4423 avg=0.3908 it/s=413.8
[e7 b694/2315] loss=0.6819 avg=0.3815 it/s=416.8
[e7 b925/2315] loss=0.8062 avg=0.3817 it/s=410.6
[e7 b1156/2315] loss=0.2033 avg=0.3794 it/s=403.4
[e7 b1387/2315] loss=0.2902 avg=0.3757 it/s=394.6
[e7 b1618/2315] loss=0.7532 avg=0.3738 it/s=391.1
[e7 b1849/2315] loss=0.0738 avg=0.3708 it/s=389.6
[e7 b2080/2315] loss=0.5361 avg=0.3684 it/s=388.3
[e7 b2311/2315] loss=0.9529 avg=0.3676 it/s=387.4


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 7/12 | loss=0.3676 | val_acc=0.8333 | val_f1=0.8379 | time=100.3s
[e8 b1/2315] loss=0.3930 avg=0.3930 it/s=286.3
[e8 b2/2315] loss=0.1965 avg=0.2948 it/s=297.5


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e8 b232/2315] loss=0.2969 avg=0.3145 it/s=376.8
[e8 b463/2315] loss=0.5247 avg=0.3231 it/s=388.3
[e8 b694/2315] loss=0.1069 avg=0.3229 it/s=395.1
[e8 b925/2315] loss=0.2114 avg=0.3225 it/s=397.7
[e8 b1156/2315] loss=0.4742 avg=0.3200 it/s=400.6
[e8 b1387/2315] loss=0.4421 avg=0.3210 it/s=395.6
[e8 b1618/2315] loss=0.3975 avg=0.3155 it/s=386.9
[e8 b1849/2315] loss=0.5935 avg=0.3151 it/s=381.1
[e8 b2080/2315] loss=0.5876 avg=0.3156 it/s=381.7
[e8 b2311/2315] loss=0.5540 avg=0.3127 it/s=381.0


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 8/12 | loss=0.3125 | val_acc=0.8550 | val_f1=0.8588 | time=102.0s
[e9 b1/2315] loss=0.0186 avg=0.0186 it/s=339.4
[e9 b2/2315] loss=0.4054 avg=0.2120 it/s=341.2


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e9 b232/2315] loss=0.7347 avg=0.2574 it/s=386.8
[e9 b463/2315] loss=0.7491 avg=0.2545 it/s=389.8
[e9 b694/2315] loss=0.4518 avg=0.2692 it/s=384.0
[e9 b925/2315] loss=0.0222 avg=0.2720 it/s=384.8
[e9 b1156/2315] loss=0.0440 avg=0.2695 it/s=387.1
[e9 b1387/2315] loss=0.2967 avg=0.2692 it/s=389.4
[e9 b1618/2315] loss=0.4574 avg=0.2708 it/s=390.3
[e9 b1849/2315] loss=0.0261 avg=0.2692 it/s=386.7
[e9 b2080/2315] loss=0.0546 avg=0.2677 it/s=386.0
[e9 b2311/2315] loss=0.0468 avg=0.2668 it/s=385.9


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 9/12 | loss=0.2666 | val_acc=0.8605 | val_f1=0.8649 | time=100.9s
[e10 b1/2315] loss=0.1991 avg=0.1991 it/s=360.2
[e10 b2/2315] loss=0.0182 avg=0.1086 it/s=406.1


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e10 b232/2315] loss=0.1639 avg=0.2118 it/s=386.0
[e10 b463/2315] loss=0.5413 avg=0.2343 it/s=391.0
[e10 b694/2315] loss=0.1627 avg=0.2377 it/s=391.4
[e10 b925/2315] loss=0.0202 avg=0.2341 it/s=386.6
[e10 b1156/2315] loss=0.1209 avg=0.2322 it/s=381.9
[e10 b1387/2315] loss=0.3029 avg=0.2324 it/s=384.3
[e10 b1618/2315] loss=0.5801 avg=0.2330 it/s=389.3
[e10 b1849/2315] loss=0.0296 avg=0.2321 it/s=392.6
[e10 b2080/2315] loss=0.1710 avg=0.2288 it/s=392.6
[e10 b2311/2315] loss=0.2548 avg=0.2290 it/s=391.6


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 10/12 | loss=0.2288 | val_acc=0.8632 | val_f1=0.8668 | time=99.7s
[e11 b1/2315] loss=0.0471 avg=0.0471 it/s=325.0
[e11 b2/2315] loss=0.0551 avg=0.0511 it/s=341.6


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e11 b232/2315] loss=0.1704 avg=0.1982 it/s=363.3
[e11 b463/2315] loss=0.0103 avg=0.1928 it/s=357.9
[e11 b694/2315] loss=0.1265 avg=0.2061 it/s=354.3
[e11 b925/2315] loss=0.1211 avg=0.2041 it/s=358.7
[e11 b1156/2315] loss=0.2088 avg=0.2035 it/s=363.4
[e11 b1387/2315] loss=0.0127 avg=0.1997 it/s=367.6
[e11 b1618/2315] loss=0.3502 avg=0.1996 it/s=369.6
[e11 b1849/2315] loss=0.5753 avg=0.1975 it/s=374.4
[e11 b2080/2315] loss=0.1779 avg=0.1956 it/s=378.1
[e11 b2311/2315] loss=0.4049 avg=0.1954 it/s=381.6


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 11/12 | loss=0.1954 | val_acc=0.8627 | val_f1=0.8662 | time=101.9s
[e12 b1/2315] loss=0.1049 avg=0.1049 it/s=407.4
[e12 b2/2315] loss=0.0378 avg=0.0714 it/s=429.3


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e12 b232/2315] loss=0.4572 avg=0.1775 it/s=368.6
[e12 b463/2315] loss=0.3929 avg=0.1645 it/s=368.5
[e12 b694/2315] loss=0.0509 avg=0.1697 it/s=365.3
[e12 b925/2315] loss=0.4747 avg=0.1681 it/s=367.3
[e12 b1156/2315] loss=0.0461 avg=0.1660 it/s=369.2
[e12 b1387/2315] loss=0.1163 avg=0.1684 it/s=370.3
[e12 b1618/2315] loss=0.0070 avg=0.1678 it/s=373.2
[e12 b1849/2315] loss=0.0210 avg=0.1678 it/s=375.9
[e12 b2080/2315] loss=0.0162 avg=0.1665 it/s=376.5
[e12 b2311/2315] loss=0.3777 avg=0.1683 it/s=380.2


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
[34m[1mwandb[0m: [32m[41mERROR[0m The nbformat package was not found. It is required to save notebook history.


Epoch 12/12 | loss=0.1681 | val_acc=0.8703 | val_f1=0.8733 | time=102.1s


0,1
epoch,▁▁▁▁▁▂▂▂▂▃▃▃▃▄▄▄▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇████
lr,█▇▇▆▅▅▄▄▃▂▂▁
params/ratio,▁
params/total,▁
params/trainable,▁
step,▁▁▁▁▁▁▂▁▁▁▁▁▄▁▂▁▁▁▁▂▁▁▁▁▆▂▁▁▂▁▁▁▁▁▂█▁▁▁▂
time/epoch_sec,▅▁▆▅▃█▄▇▅▃▇▇
train/avg_loss_so_far,██▇▆▅▅▄▄▄▄▃▃▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▁▂▂▂
train/epoch_loss,█▆▅▄▄▃▃▂▂▂▁▁
train/items_per_sec,▁▆▆▅▅▆▆▆▆▂▆▅▅▅▅▅▅▅▅▅▅▆█▅▅▃▅▅▄▆▅▆▅▅▆▅▅▅▇▅

0,1
best_checkpoint_path,checkpoints\best_mic...
best_val_f1,0.87325
epoch,12
lr,0
params/ratio,0.23093
params/total,278813189
params/trainable,64385285
step,27776
time/epoch_sec,102.14034
train/avg_loss_so_far,0.16833


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
Best trial: 13. Best value: 0.873252:  70%|███████   | 14/20 [2:54:42<1:30:14, 902.47s/it]

[Trial 13] f1=0.8733 | unfreeze_k=9 lr=1.03e-04 wd=3.4e-05 suggested_bs=8
[I 2025-08-16 05:18:13,480] Trial 13 finished with value: 0.8732520507590781 and parameters: {'num_unfreeze_last_layers': 9, 'lr': 0.00010344194727767704, 'weight_decay': 3.4396049618939587e-05, 'batch_size': 8}. Best is trial 13 with value: 0.8732520507590781.


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/mdeberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  scaler = GradScaler(enabled=(DEVICE == "cuda" and USE_AMP))


Trainable params: 64,385,285 / 278,813,189 (23.09%) ; unfreeze_last_k=9
[e1 b1/2315] loss=1.6409 avg=1.6409 it/s=136.7
[e1 b2/2315] loss=1.5748 avg=1.6079 it/s=188.7


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e1 b232/2315] loss=1.6867 avg=1.5434 it/s=395.4
[e1 b463/2315] loss=1.6168 avg=1.4051 it/s=382.3
[e1 b694/2315] loss=0.9030 avg=1.3068 it/s=376.2
[e1 b925/2315] loss=1.3521 avg=1.2678 it/s=372.1
[e1 b1156/2315] loss=1.2296 avg=1.2808 it/s=377.7
[e1 b1387/2315] loss=1.6142 avg=1.3012 it/s=381.6
[e1 b1618/2315] loss=1.4930 avg=1.3424 it/s=384.7
[e1 b1849/2315] loss=1.5065 avg=1.3733 it/s=386.4
[e1 b2080/2315] loss=1.4994 avg=1.3964 it/s=389.2
[e1 b2311/2315] loss=1.5668 avg=1.4149 it/s=390.3


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 1/12 | loss=1.4153 | val_acc=0.2775 | val_f1=0.0869 | time=99.5s
[e2 b1/2315] loss=1.5454 avg=1.5454 it/s=392.8
[e2 b2/2315] loss=1.5869 avg=1.5662 it/s=385.7


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e2 b232/2315] loss=1.6426 avg=1.5827 it/s=420.7
[e2 b463/2315] loss=1.5391 avg=1.5796 it/s=420.3
[e2 b694/2315] loss=1.5538 avg=1.5796 it/s=420.7
[e2 b925/2315] loss=1.5417 avg=1.5789 it/s=411.4
[e2 b1156/2315] loss=1.5335 avg=1.5779 it/s=401.3
[e2 b1387/2315] loss=1.5434 avg=1.5772 it/s=396.8
[e2 b1618/2315] loss=1.6870 avg=1.5780 it/s=394.5
[e2 b1849/2315] loss=1.6454 avg=1.5787 it/s=390.5
[e2 b2080/2315] loss=1.6884 avg=1.5780 it/s=388.8
[e2 b2311/2315] loss=1.4932 avg=1.5777 it/s=389.0


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 2/12 | loss=1.5777 | val_acc=0.2775 | val_f1=0.0869 | time=100.0s
[e3 b1/2315] loss=1.5056 avg=1.5056 it/s=459.8
[e3 b2/2315] loss=1.4698 avg=1.4877 it/s=373.1


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e3 b232/2315] loss=1.6377 avg=1.5727 it/s=393.3
[e3 b463/2315] loss=1.5586 avg=1.5787 it/s=389.4
[e3 b694/2315] loss=1.4082 avg=1.5743 it/s=395.8
[e3 b925/2315] loss=1.6990 avg=1.5760 it/s=399.6
[e3 b1156/2315] loss=1.5059 avg=1.5752 it/s=400.9
[e3 b1387/2315] loss=1.6535 avg=1.5757 it/s=394.5
[e3 b1618/2315] loss=1.5527 avg=1.5762 it/s=389.1
[e3 b1849/2315] loss=1.6100 avg=1.5756 it/s=388.2
[e3 b2080/2315] loss=1.5211 avg=1.5761 it/s=388.7
[e3 b2311/2315] loss=1.6318 avg=1.5764 it/s=387.5


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 3/12 | loss=1.5764 | val_acc=0.2775 | val_f1=0.0869 | time=100.4s
[e4 b1/2315] loss=1.6071 avg=1.6071 it/s=531.4
[e4 b2/2315] loss=1.6536 avg=1.6304 it/s=435.6


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e4 b232/2315] loss=1.4876 avg=1.5763 it/s=394.7
[e4 b463/2315] loss=1.5368 avg=1.5759 it/s=397.4
[e4 b694/2315] loss=1.5368 avg=1.5754 it/s=398.4
[e4 b925/2315] loss=1.5404 avg=1.5760 it/s=397.3
[e4 b1156/2315] loss=1.6068 avg=1.5755 it/s=400.2
[e4 b1387/2315] loss=1.5427 avg=1.5768 it/s=402.0
[e4 b1618/2315] loss=1.5924 avg=1.5775 it/s=403.1
[e4 b1849/2315] loss=1.5744 avg=1.5766 it/s=400.5
[e4 b2080/2315] loss=1.5598 avg=1.5760 it/s=398.4
[e4 b2311/2315] loss=1.5969 avg=1.5761 it/s=396.1


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 4/12 | loss=1.5761 | val_acc=0.2775 | val_f1=0.0869 | time=98.4s
[e5 b1/2315] loss=1.6475 avg=1.6475 it/s=304.5
[e5 b2/2315] loss=1.6320 avg=1.6398 it/s=361.8


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e5 b232/2315] loss=1.5366 avg=1.5768 it/s=396.1
[e5 b463/2315] loss=1.4937 avg=1.5767 it/s=404.1
[e5 b694/2315] loss=1.6736 avg=1.5779 it/s=407.1
[e5 b925/2315] loss=1.5456 avg=1.5754 it/s=399.4
[e5 b1156/2315] loss=1.5562 avg=1.5756 it/s=397.0
[e5 b1387/2315] loss=1.5684 avg=1.5765 it/s=398.1
[e5 b1618/2315] loss=1.6767 avg=1.5769 it/s=400.8
[e5 b1849/2315] loss=1.7046 avg=1.5768 it/s=403.3
[e5 b2080/2315] loss=1.5137 avg=1.5769 it/s=405.6
[e5 b2311/2315] loss=1.4953 avg=1.5759 it/s=406.8


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
[34m[1mwandb[0m: [32m[41mERROR[0m The nbformat package was not found. It is required to save notebook history.


Early stopping at epoch 5


0,1
epoch,▁▁▁▁▁▁▁▁▃▃▃▃▃▃▃▃▃▃▃▃▅▅▅▅▅▆▆▆▆▆▆▆████████
lr,█▆▅▃▁
params/ratio,▁
params/total,▁
params/trainable,▁
step,▁▁▁▂▂▂▂▂▂▃▃▁▃▃▁▂▂▂▂▅▂▂▂▂▂▃▃▇▁▁▂█▂▂▂▃▁▁▁▂
time/epoch_sec,▇▇█▅▁
train/avg_loss_so_far,█▄▂▁▁▃▃▇▇▇▇▇▇▇▅▇▇▇▇▇▇▇▇▇█▇▇▇▇▇▇██▇▇▇▇▇▇▇
train/epoch_loss,▁████
train/items_per_sec,▁▂▆▅▅▅▅▅▆▆▆▆▆▆▆▅▇▆▅▆▆▆▅▅▅█▆▆▆▆▆▆▆▆▄▆▆▆▆▆

0,1
best_checkpoint_path,checkpoints\best_mic...
best_val_f1,0.08688
epoch,5
lr,0.00018
params/ratio,0.23093
params/total,278813189
params/trainable,64385285
step,11571
time/epoch_sec,95.82265
train/avg_loss_so_far,1.57592


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
Best trial: 13. Best value: 0.873252:  75%|███████▌  | 15/20 [3:03:06<1:05:12, 782.51s/it]

[Trial 14] f1=0.0869 | unfreeze_k=9 lr=2.87e-04 wd=4.2e-06 suggested_bs=8
[I 2025-08-16 05:26:37,992] Trial 14 finished with value: 0.08687713959680486 and parameters: {'num_unfreeze_last_layers': 9, 'lr': 0.00028664415955693967, 'weight_decay': 4.169196508305699e-06, 'batch_size': 8}. Best is trial 13 with value: 0.8732520507590781.


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/mdeberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  scaler = GradScaler(enabled=(DEVICE == "cuda" and USE_AMP))


Trainable params: 64,385,285 / 278,813,189 (23.09%) ; unfreeze_last_k=9
[e1 b1/2315] loss=1.5346 avg=1.5346 it/s=137.7
[e1 b2/2315] loss=1.5595 avg=1.5471 it/s=191.3


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e1 b232/2315] loss=1.5076 avg=1.5410 it/s=374.7
[e1 b463/2315] loss=1.5271 avg=1.5104 it/s=379.0
[e1 b694/2315] loss=1.6102 avg=1.5351 it/s=383.0
[e1 b925/2315] loss=1.7241 avg=1.5477 it/s=385.9
[e1 b1156/2315] loss=1.4760 avg=1.5557 it/s=384.8
[e1 b1387/2315] loss=1.5452 avg=1.5608 it/s=384.9
[e1 b1618/2315] loss=1.4913 avg=1.5647 it/s=387.3
[e1 b1849/2315] loss=1.6120 avg=1.5669 it/s=390.2
[e1 b2080/2315] loss=1.5931 avg=1.5684 it/s=391.8
[e1 b2311/2315] loss=1.7207 avg=1.5701 it/s=393.2


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 1/12 | loss=1.5701 | val_acc=0.2775 | val_f1=0.0869 | time=99.0s
[e2 b1/2315] loss=1.6194 avg=1.6194 it/s=366.7
[e2 b2/2315] loss=1.5991 avg=1.6092 it/s=377.9


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e2 b232/2315] loss=1.6668 avg=1.5816 it/s=379.4
[e2 b463/2315] loss=1.5586 avg=1.5816 it/s=364.7
[e2 b694/2315] loss=1.6562 avg=1.5815 it/s=359.2
[e2 b925/2315] loss=1.5274 avg=1.5822 it/s=364.2
[e2 b1156/2315] loss=1.5360 avg=1.5815 it/s=366.5
[e2 b1387/2315] loss=1.4732 avg=1.5801 it/s=370.3
[e2 b1618/2315] loss=1.5180 avg=1.5803 it/s=372.7
[e2 b1849/2315] loss=1.4818 avg=1.5797 it/s=374.3
[e2 b2080/2315] loss=1.4829 avg=1.5798 it/s=376.9
[e2 b2311/2315] loss=1.5224 avg=1.5792 it/s=379.1


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 2/12 | loss=1.5793 | val_acc=0.2775 | val_f1=0.0869 | time=102.5s
[e3 b1/2315] loss=1.5624 avg=1.5624 it/s=412.6
[e3 b2/2315] loss=1.5599 avg=1.5611 it/s=424.3


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e3 b232/2315] loss=1.7709 avg=1.5758 it/s=403.7
[e3 b463/2315] loss=1.5236 avg=1.5800 it/s=400.5
[e3 b694/2315] loss=1.6480 avg=1.5804 it/s=386.0
[e3 b925/2315] loss=1.6711 avg=1.5793 it/s=382.5
[e3 b1156/2315] loss=1.5469 avg=1.5795 it/s=384.4
[e3 b1387/2315] loss=1.6139 avg=1.5800 it/s=384.3
[e3 b1618/2315] loss=1.5051 avg=1.5787 it/s=384.2
[e3 b1849/2315] loss=1.5730 avg=1.5790 it/s=385.0
[e3 b2080/2315] loss=1.6501 avg=1.5783 it/s=387.0
[e3 b2311/2315] loss=1.5629 avg=1.5783 it/s=387.1


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 3/12 | loss=1.5784 | val_acc=0.2775 | val_f1=0.0869 | time=100.4s
[e4 b1/2315] loss=1.5756 avg=1.5756 it/s=386.2
[e4 b2/2315] loss=1.6278 avg=1.6017 it/s=407.1


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e4 b232/2315] loss=1.5228 avg=1.5798 it/s=394.0
[e4 b463/2315] loss=1.5855 avg=1.5767 it/s=400.5
[e4 b694/2315] loss=1.5701 avg=1.5772 it/s=402.3
[e4 b925/2315] loss=1.5775 avg=1.5792 it/s=405.2
[e4 b1156/2315] loss=1.5559 avg=1.5789 it/s=402.2
[e4 b1387/2315] loss=1.5649 avg=1.5789 it/s=395.5
[e4 b1618/2315] loss=1.6140 avg=1.5786 it/s=391.4
[e4 b1849/2315] loss=1.6234 avg=1.5788 it/s=386.7
[e4 b2080/2315] loss=1.6995 avg=1.5775 it/s=386.5
[e4 b2311/2315] loss=1.6051 avg=1.5777 it/s=387.3


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 4/12 | loss=1.5776 | val_acc=0.2775 | val_f1=0.0869 | time=100.5s
[e5 b1/2315] loss=1.5206 avg=1.5206 it/s=278.2
[e5 b2/2315] loss=1.6115 avg=1.5661 it/s=308.9


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e5 b232/2315] loss=1.4688 avg=1.5657 it/s=379.5
[e5 b463/2315] loss=1.5617 avg=1.5723 it/s=384.9
[e5 b694/2315] loss=1.5938 avg=1.5731 it/s=386.8
[e5 b925/2315] loss=1.5695 avg=1.5748 it/s=393.8
[e5 b1156/2315] loss=1.6238 avg=1.5753 it/s=397.5
[e5 b1387/2315] loss=1.6349 avg=1.5754 it/s=399.3
[e5 b1618/2315] loss=1.6670 avg=1.5758 it/s=398.5
[e5 b1849/2315] loss=1.6098 avg=1.5768 it/s=391.3
[e5 b2080/2315] loss=1.4929 avg=1.5771 it/s=386.9
[e5 b2311/2315] loss=1.6209 avg=1.5772 it/s=385.6


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
[34m[1mwandb[0m: [32m[41mERROR[0m The nbformat package was not found. It is required to save notebook history.


Early stopping at epoch 5


0,1
epoch,▁▁▁▁▁▁▁▃▃▃▃▃▃▃▃▃▅▅▅▅▅▅▅▅▅▅▅▅▆▆▆▆▆▆██████
lr,█▆▄▃▁
params/ratio,▁
params/total,▁
params/trainable,▁
step,▁▁▁▂▂▃▃▃▃▂▂▂▄▂▃▅▁▁▁▂▂▇▂▃▃▁▁▂▂█▂▂▂▂▃▁▁▂▂▃
time/epoch_sec,▁█▄▄▅
train/avg_loss_so_far,▃▃▃▁▄▄▅▅▅█▆▆▆▅▅▅▄▄▅▅▅▅▅▅▅▅▅▅▅▅▅▂▅▅▅▅▅▅▅▅
train/epoch_loss,▁█▇▇▆
train/items_per_sec,▁▇▇▇▇▇▇▇▇▆▆▆▆▆▆▇██▇▇▇▇▇▇▇▇▇▇▇▇▇▇▄▅▇▇▇▇▇▇

0,1
best_checkpoint_path,checkpoints\best_mic...
best_val_f1,0.08688
epoch,5
lr,0.00034
params/ratio,0.23093
params/total,278813189
params/trainable,64385285
step,11571
time/epoch_sec,100.89559
train/avg_loss_so_far,1.57719


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
Best trial: 13. Best value: 0.873252:  80%|████████  | 16/20 [3:11:40<46:46, 701.68s/it]  

[Trial 15] f1=0.0869 | unfreeze_k=9 lr=5.41e-04 wd=8.3e-06 suggested_bs=8
[I 2025-08-16 05:35:11,967] Trial 15 finished with value: 0.08687713959680486 and parameters: {'num_unfreeze_last_layers': 9, 'lr': 0.0005412386442534567, 'weight_decay': 8.343157852100824e-06, 'batch_size': 8}. Best is trial 13 with value: 0.8732520507590781.


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/mdeberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  scaler = GradScaler(enabled=(DEVICE == "cuda" and USE_AMP))


Trainable params: 64,385,285 / 278,813,189 (23.09%) ; unfreeze_last_k=9
[e1 b1/2315] loss=1.5989 avg=1.5989 it/s=292.1
[e1 b2/2315] loss=1.7152 avg=1.6570 it/s=343.6


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e1 b232/2315] loss=1.6101 avg=1.5693 it/s=379.1
[e1 b463/2315] loss=0.9628 avg=1.4449 it/s=384.7
[e1 b694/2315] loss=0.9013 avg=1.3227 it/s=389.7
[e1 b925/2315] loss=0.9742 avg=1.2395 it/s=393.2
[e1 b1156/2315] loss=0.7871 avg=1.1903 it/s=397.5
[e1 b1387/2315] loss=0.3955 avg=1.1538 it/s=401.9
[e1 b1618/2315] loss=1.8110 avg=1.1318 it/s=404.4
[e1 b1849/2315] loss=1.6801 avg=1.1289 it/s=404.1
[e1 b2080/2315] loss=1.6100 avg=1.1681 it/s=398.6
[e1 b2311/2315] loss=1.5448 avg=1.2098 it/s=393.9


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 1/12 | loss=1.2105 | val_acc=0.2775 | val_f1=0.0869 | time=99.1s
[e2 b1/2315] loss=1.7106 avg=1.7106 it/s=443.1
[e2 b2/2315] loss=1.6065 avg=1.6585 it/s=446.6


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e2 b232/2315] loss=1.6524 avg=1.5785 it/s=400.1
[e2 b463/2315] loss=1.6319 avg=1.5789 it/s=407.2
[e2 b694/2315] loss=1.6252 avg=1.5773 it/s=404.1
[e2 b925/2315] loss=1.4666 avg=1.5776 it/s=400.4
[e2 b1156/2315] loss=1.5610 avg=1.5782 it/s=394.9
[e2 b1387/2315] loss=1.6402 avg=1.5783 it/s=388.1
[e2 b1618/2315] loss=1.6044 avg=1.5790 it/s=391.5
[e2 b1849/2315] loss=1.6924 avg=1.5791 it/s=394.9
[e2 b2080/2315] loss=1.5308 avg=1.5794 it/s=397.7
[e2 b2311/2315] loss=1.6010 avg=1.5792 it/s=398.7


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 2/12 | loss=1.5791 | val_acc=0.2775 | val_f1=0.0869 | time=98.0s
[e3 b1/2315] loss=1.6315 avg=1.6315 it/s=285.9
[e3 b2/2315] loss=1.5511 avg=1.5913 it/s=314.2


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e3 b232/2315] loss=1.5250 avg=1.5762 it/s=347.1
[e3 b463/2315] loss=1.5635 avg=1.5762 it/s=360.6
[e3 b694/2315] loss=1.5632 avg=1.5798 it/s=366.1
[e3 b925/2315] loss=1.3925 avg=1.5785 it/s=366.2
[e3 b1156/2315] loss=1.4867 avg=1.5773 it/s=369.1
[e3 b1387/2315] loss=1.5945 avg=1.5779 it/s=372.4
[e3 b1618/2315] loss=1.6295 avg=1.5775 it/s=374.4
[e3 b1849/2315] loss=1.5891 avg=1.5774 it/s=375.2
[e3 b2080/2315] loss=1.5631 avg=1.5774 it/s=377.6
[e3 b2311/2315] loss=1.4891 avg=1.5771 it/s=379.7


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 3/12 | loss=1.5772 | val_acc=0.2775 | val_f1=0.0869 | time=102.2s
[e4 b1/2315] loss=1.5638 avg=1.5638 it/s=414.2
[e4 b2/2315] loss=1.5372 avg=1.5505 it/s=439.2


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e4 b232/2315] loss=1.5291 avg=1.5760 it/s=396.9
[e4 b463/2315] loss=1.6895 avg=1.5769 it/s=376.5
[e4 b694/2315] loss=1.5361 avg=1.5753 it/s=367.5
[e4 b925/2315] loss=1.5203 avg=1.5754 it/s=362.7
[e4 b1156/2315] loss=1.4371 avg=1.5747 it/s=368.6
[e4 b1387/2315] loss=1.5221 avg=1.5749 it/s=370.4
[e4 b1618/2315] loss=1.5479 avg=1.5760 it/s=374.9
[e4 b1849/2315] loss=1.6011 avg=1.5763 it/s=376.4
[e4 b2080/2315] loss=1.6386 avg=1.5766 it/s=379.6
[e4 b2311/2315] loss=1.5849 avg=1.5765 it/s=380.9


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 4/12 | loss=1.5764 | val_acc=0.2775 | val_f1=0.0869 | time=101.9s
[e5 b1/2315] loss=1.5676 avg=1.5676 it/s=689.2
[e5 b2/2315] loss=1.6071 avg=1.5873 it/s=429.5


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e5 b232/2315] loss=1.5603 avg=1.5734 it/s=402.8
[e5 b463/2315] loss=1.6199 avg=1.5744 it/s=398.2
[e5 b694/2315] loss=1.5838 avg=1.5771 it/s=397.2
[e5 b925/2315] loss=1.4901 avg=1.5784 it/s=395.8
[e5 b1156/2315] loss=1.6344 avg=1.5767 it/s=385.0
[e5 b1387/2315] loss=1.5528 avg=1.5773 it/s=381.3
[e5 b1618/2315] loss=1.6099 avg=1.5773 it/s=381.6
[e5 b1849/2315] loss=1.5161 avg=1.5768 it/s=381.6
[e5 b2080/2315] loss=1.6195 avg=1.5769 it/s=380.4
[e5 b2311/2315] loss=1.5403 avg=1.5760 it/s=380.9


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
[34m[1mwandb[0m: [32m[41mERROR[0m The nbformat package was not found. It is required to save notebook history.


Early stopping at epoch 5


0,1
epoch,▁▁▁▁▁▁▃▃▃▃▃▃▃▅▅▅▅▅▅▅▆▆▆▆▆▆▆▆▆▆▆▆████████
lr,█▆▄▃▁
params/ratio,▁
params/total,▁
params/trainable,▁
step,▁▁▁▁▂▂▂▂▁▁▁▁▂▂▂▂▂▂▁▁▂▂▂▂▅▂▂▂▂▁▁▁▁▁▂▂▂▂▂█
time/epoch_sec,▃▁█▇█
train/avg_loss_so_far,▇█▇▅▄▂▁▁▁▂█▇▇▇▇▇▇█▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇
train/epoch_loss,▁████
train/items_per_sec,▅▅▆▆▆▆▅█▆▆▅▅▅▅▆▁▃▄▄▄▄▄▄▅▆▅▄▄▄▄▄▄▇▆▆▅▅▅▅▅

0,1
best_checkpoint_path,checkpoints\best_mic...
best_val_f1,0.08688
epoch,5
lr,0.00013
params/ratio,0.23093
params/total,278813189
params/trainable,64385285
step,11571
time/epoch_sec,102.09456
train/avg_loss_so_far,1.57601


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
Best trial: 13. Best value: 0.873252:  85%|████████▌ | 17/20 [3:20:13<32:14, 644.90s/it]

[Trial 16] f1=0.0869 | unfreeze_k=9 lr=2.14e-04 wd=5.2e-05 suggested_bs=8
[I 2025-08-16 05:43:44,825] Trial 16 finished with value: 0.08687713959680486 and parameters: {'num_unfreeze_last_layers': 9, 'lr': 0.00021406994305558442, 'weight_decay': 5.182521308036029e-05, 'batch_size': 8}. Best is trial 13 with value: 0.8732520507590781.


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/mdeberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  scaler = GradScaler(enabled=(DEVICE == "cuda" and USE_AMP))


Trainable params: 64,385,285 / 278,813,189 (23.09%) ; unfreeze_last_k=9
[e1 b1/2315] loss=1.6203 avg=1.6203 it/s=124.6
[e1 b2/2315] loss=1.6982 avg=1.6593 it/s=177.1


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e1 b232/2315] loss=1.6761 avg=1.5913 it/s=386.3
[e1 b463/2315] loss=1.5770 avg=1.5957 it/s=393.2
[e1 b694/2315] loss=1.5673 avg=1.5933 it/s=394.3
[e1 b925/2315] loss=1.6086 avg=1.5914 it/s=394.2
[e1 b1156/2315] loss=1.5075 avg=1.5901 it/s=390.6
[e1 b1387/2315] loss=1.5861 avg=1.5900 it/s=384.3
[e1 b1618/2315] loss=1.5267 avg=1.5888 it/s=384.1
[e1 b1849/2315] loss=1.6928 avg=1.5871 it/s=385.0
[e1 b2080/2315] loss=1.4814 avg=1.5866 it/s=382.2
[e1 b2311/2315] loss=1.5306 avg=1.5857 it/s=379.3


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 1/12 | loss=1.5857 | val_acc=0.2775 | val_f1=0.0869 | time=102.5s
[e2 b1/2315] loss=1.6595 avg=1.6595 it/s=316.6
[e2 b2/2315] loss=1.5931 avg=1.6263 it/s=330.2


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e2 b232/2315] loss=1.6060 avg=1.5800 it/s=390.2
[e2 b463/2315] loss=1.6108 avg=1.5809 it/s=385.2
[e2 b694/2315] loss=1.5600 avg=1.5817 it/s=392.3
[e2 b925/2315] loss=1.5686 avg=1.5803 it/s=396.3
[e2 b1156/2315] loss=1.5358 avg=1.5798 it/s=397.9
[e2 b1387/2315] loss=1.5457 avg=1.5805 it/s=396.5
[e2 b1618/2315] loss=1.5262 avg=1.5791 it/s=389.1
[e2 b1849/2315] loss=1.5435 avg=1.5788 it/s=385.0
[e2 b2080/2315] loss=1.7093 avg=1.5781 it/s=384.1
[e2 b2311/2315] loss=1.5414 avg=1.5769 it/s=384.7


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 2/12 | loss=1.5770 | val_acc=0.2775 | val_f1=0.0869 | time=101.2s
[e3 b1/2315] loss=1.5864 avg=1.5864 it/s=416.3
[e3 b2/2315] loss=1.4722 avg=1.5293 it/s=449.1


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e3 b232/2315] loss=1.5167 avg=1.5704 it/s=394.8
[e3 b463/2315] loss=1.6761 avg=1.5754 it/s=400.7
[e3 b694/2315] loss=1.4544 avg=1.5741 it/s=401.5
[e3 b925/2315] loss=1.5622 avg=1.5744 it/s=402.2
[e3 b1156/2315] loss=1.5375 avg=1.5744 it/s=402.6
[e3 b1387/2315] loss=1.7032 avg=1.5740 it/s=405.9
[e3 b1618/2315] loss=1.5970 avg=1.5745 it/s=408.3
[e3 b1849/2315] loss=1.5934 avg=1.5757 it/s=408.6
[e3 b2080/2315] loss=1.5098 avg=1.5751 it/s=404.5
[e3 b2311/2315] loss=1.5430 avg=1.5754 it/s=400.2


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 3/12 | loss=1.5752 | val_acc=0.2775 | val_f1=0.0869 | time=97.5s
[e4 b1/2315] loss=1.5548 avg=1.5548 it/s=479.9
[e4 b2/2315] loss=1.5427 avg=1.5488 it/s=401.6


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e4 b232/2315] loss=1.6602 avg=1.5711 it/s=378.7
[e4 b463/2315] loss=1.4953 avg=1.5733 it/s=393.7
[e4 b694/2315] loss=1.4728 avg=1.5752 it/s=391.4
[e4 b925/2315] loss=1.6954 avg=1.5743 it/s=391.4
[e4 b1156/2315] loss=1.6849 avg=1.5755 it/s=387.0
[e4 b1387/2315] loss=1.5441 avg=1.5750 it/s=388.5
[e4 b1618/2315] loss=1.5284 avg=1.5750 it/s=390.8
[e4 b1849/2315] loss=1.4916 avg=1.5748 it/s=394.3
[e4 b2080/2315] loss=1.5810 avg=1.5752 it/s=397.3
[e4 b2311/2315] loss=1.5450 avg=1.5756 it/s=399.7


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 4/12 | loss=1.5757 | val_acc=0.2775 | val_f1=0.0869 | time=97.5s
[e5 b1/2315] loss=1.4974 avg=1.4974 it/s=327.3
[e5 b2/2315] loss=1.5242 avg=1.5108 it/s=398.7


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e5 b232/2315] loss=1.5675 avg=1.5732 it/s=374.2
[e5 b463/2315] loss=1.5198 avg=1.5744 it/s=373.9
[e5 b694/2315] loss=1.6139 avg=1.5741 it/s=370.2
[e5 b925/2315] loss=1.7067 avg=1.5733 it/s=371.5
[e5 b1156/2315] loss=1.7079 avg=1.5744 it/s=366.8
[e5 b1387/2315] loss=1.6235 avg=1.5747 it/s=365.5
[e5 b1618/2315] loss=1.5591 avg=1.5750 it/s=368.2
[e5 b1849/2315] loss=1.5382 avg=1.5747 it/s=370.8
[e5 b2080/2315] loss=1.6275 avg=1.5748 it/s=372.9
[e5 b2311/2315] loss=1.5980 avg=1.5752 it/s=376.4


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
[34m[1mwandb[0m: [32m[41mERROR[0m The nbformat package was not found. It is required to save notebook history.


Early stopping at epoch 5


0,1
epoch,▁▁▁▁▁▁▁▁▁▃▃▃▃▃▃▃▃▅▅▅▅▅▅▅▅▆▆▆▆▆▆▆▆▆██████
lr,█▆▄▃▁
params/ratio,▁
params/total,▁
params/trainable,▁
step,▁▁▂▂▂▃▁▂▂▂▂▂▂▃▃▅▃▁▁▁▇▂▂▇▃█▁▁▂▂▂▃▃▃▁▂▂▃▃▃
time/epoch_sec,▇▆▁▁█
train/avg_loss_so_far,▆▅▅▅▅▅▅▅█▇▅▅▅▅▄▅▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▁▂▄▄▄▄▄▄▄
train/epoch_loss,█▂▁▁▁
train/items_per_sec,▁▂▇▇▇▇▆▅▇▇▇▇▇▇▇█▇▇▇▇▇▇▇▇▇▇▆▇▇▇▇▇▇▇▅▆▆▆▆▆

0,1
best_checkpoint_path,checkpoints\best_mic...
best_val_f1,0.08688
epoch,5
lr,0.00113
params/ratio,0.23093
params/total,278813189
params/trainable,64385285
step,11571
time/epoch_sec,103.1572
train/avg_loss_so_far,1.57524


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
Best trial: 13. Best value: 0.873252:  90%|█████████ | 18/20 [3:28:46<20:10, 605.05s/it]

[Trial 17] f1=0.0869 | unfreeze_k=9 lr=1.82e-03 wd=1.9e-05 suggested_bs=8
[I 2025-08-16 05:52:17,111] Trial 17 finished with value: 0.08687713959680486 and parameters: {'num_unfreeze_last_layers': 9, 'lr': 0.0018193251310341153, 'weight_decay': 1.8921760002592513e-05, 'batch_size': 8}. Best is trial 13 with value: 0.8732520507590781.


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/mdeberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  scaler = GradScaler(enabled=(DEVICE == "cuda" and USE_AMP))


Trainable params: 57,297,413 / 278,813,189 (20.55%) ; unfreeze_last_k=8
[e1 b1/2315] loss=1.6091 avg=1.6091 it/s=294.9
[e1 b2/2315] loss=1.6361 avg=1.6226 it/s=340.5


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e1 b232/2315] loss=1.7800 avg=1.6097 it/s=412.4
[e1 b463/2315] loss=1.5634 avg=1.6074 it/s=406.6
[e1 b694/2315] loss=1.6055 avg=1.6009 it/s=398.2
[e1 b925/2315] loss=1.7620 avg=1.6009 it/s=394.7
[e1 b1156/2315] loss=1.4718 avg=1.6035 it/s=400.6
[e1 b1387/2315] loss=1.5404 avg=1.5998 it/s=403.3
[e1 b1618/2315] loss=1.6599 avg=1.5955 it/s=406.1
[e1 b1849/2315] loss=1.6199 avg=1.5925 it/s=406.7
[e1 b2080/2315] loss=1.6979 avg=1.5908 it/s=406.9
[e1 b2311/2315] loss=1.6889 avg=1.5894 it/s=405.8


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 1/12 | loss=1.5893 | val_acc=0.2775 | val_f1=0.0869 | time=96.1s
[e2 b1/2315] loss=1.4611 avg=1.4611 it/s=353.5
[e2 b2/2315] loss=1.5447 avg=1.5029 it/s=416.9


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e2 b232/2315] loss=1.5334 avg=1.5752 it/s=439.2
[e2 b463/2315] loss=1.5187 avg=1.5803 it/s=439.7
[e2 b694/2315] loss=1.5387 avg=1.5792 it/s=442.3
[e2 b925/2315] loss=1.6701 avg=1.5786 it/s=435.0
[e2 b1156/2315] loss=1.5515 avg=1.5784 it/s=419.5
[e2 b1387/2315] loss=1.6537 avg=1.5766 it/s=414.9
[e2 b1618/2315] loss=1.4913 avg=1.5767 it/s=416.7
[e2 b1849/2315] loss=1.5390 avg=1.5760 it/s=420.3
[e2 b2080/2315] loss=1.5108 avg=1.5759 it/s=421.9
[e2 b2311/2315] loss=1.5044 avg=1.5761 it/s=422.5


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 2/12 | loss=1.5761 | val_acc=0.2775 | val_f1=0.0869 | time=92.4s
[e3 b1/2315] loss=1.5542 avg=1.5542 it/s=626.1
[e3 b2/2315] loss=1.5984 avg=1.5763 it/s=363.5


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e3 b232/2315] loss=1.5297 avg=1.5759 it/s=421.7
[e3 b463/2315] loss=1.4875 avg=1.5737 it/s=422.3
[e3 b694/2315] loss=1.5618 avg=1.5750 it/s=424.6
[e3 b925/2315] loss=1.5809 avg=1.5749 it/s=428.1
[e3 b1156/2315] loss=1.6322 avg=1.5763 it/s=431.0
[e3 b1387/2315] loss=1.6224 avg=1.5765 it/s=432.9
[e3 b1618/2315] loss=1.5372 avg=1.5767 it/s=430.9
[e3 b1849/2315] loss=1.6010 avg=1.5766 it/s=424.0
[e3 b2080/2315] loss=1.5068 avg=1.5772 it/s=420.1
[e3 b2311/2315] loss=1.6376 avg=1.5763 it/s=418.2


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 3/12 | loss=1.5762 | val_acc=0.2775 | val_f1=0.0869 | time=93.4s
[e4 b1/2315] loss=1.5334 avg=1.5334 it/s=411.0
[e4 b2/2315] loss=1.7078 avg=1.6206 it/s=443.7


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e4 b232/2315] loss=1.6972 avg=1.5719 it/s=406.0
[e4 b463/2315] loss=1.5466 avg=1.5753 it/s=405.9
[e4 b694/2315] loss=1.6076 avg=1.5763 it/s=406.9
[e4 b925/2315] loss=1.6387 avg=1.5747 it/s=407.8
[e4 b1156/2315] loss=1.5913 avg=1.5734 it/s=406.6
[e4 b1387/2315] loss=1.5058 avg=1.5740 it/s=408.4
[e4 b1618/2315] loss=1.6621 avg=1.5737 it/s=412.2
[e4 b1849/2315] loss=1.5889 avg=1.5749 it/s=415.1
[e4 b2080/2315] loss=1.4853 avg=1.5752 it/s=416.7
[e4 b2311/2315] loss=1.5928 avg=1.5758 it/s=411.6


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 4/12 | loss=1.5758 | val_acc=0.2775 | val_f1=0.0869 | time=95.2s
[e5 b1/2315] loss=1.4977 avg=1.4977 it/s=371.9
[e5 b2/2315] loss=1.5720 avg=1.5348 it/s=427.9


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e5 b232/2315] loss=1.5121 avg=1.5688 it/s=392.3
[e5 b463/2315] loss=1.4893 avg=1.5723 it/s=397.9
[e5 b694/2315] loss=1.5403 avg=1.5724 it/s=404.6
[e5 b925/2315] loss=1.5469 avg=1.5730 it/s=405.8
[e5 b1156/2315] loss=1.5044 avg=1.5747 it/s=406.2
[e5 b1387/2315] loss=1.6834 avg=1.5754 it/s=409.3
[e5 b1618/2315] loss=1.5578 avg=1.5758 it/s=413.1
[e5 b1849/2315] loss=1.5486 avg=1.5772 it/s=413.6
[e5 b2080/2315] loss=1.5842 avg=1.5772 it/s=415.8
[e5 b2311/2315] loss=1.5270 avg=1.5771 it/s=417.7


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
[34m[1mwandb[0m: [32m[41mERROR[0m The nbformat package was not found. It is required to save notebook history.


Early stopping at epoch 5


0,1
epoch,▁▁▁▁▃▃▃▃▃▃▃▃▃▃▃▅▅▅▅▅▅▅▅▆▆▆▆▆▆▆▆▆████████
lr,█▆▅▃▁
params/ratio,▁
params/total,▁
params/trainable,▁
step,▁▁▁▁▂▂▂▂▂▂▁▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▁▁▁▂▆▂▇▁▁▂▂▂█
time/epoch_sec,█▁▃▆▃
train/avg_loss_so_far,█▇▇▇▇▁▃▆▆▆▆▆▆▆▆▅▆▆▆▆▆▆▆▆█▆▆▆▆▆▆▆▃▄▆▆▆▆▆▆
train/epoch_loss,█▁▁▁▂
train/items_per_sec,▁▆▆▆▆▆▆▆▆▄██▇▇▇▇▇▇▇▇▇▇▇▆█▆▆▆▆▆▇▇▆▅▇▆▆▆▇▇

0,1
best_checkpoint_path,checkpoints\best_mic...
best_val_f1,0.08688
epoch,5
lr,0.0056
params/ratio,0.2055
params/total,278813189
params/trainable,57297413
step,11571
time/epoch_sec,93.60675
train/avg_loss_so_far,1.57706


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
Best trial: 13. Best value: 0.873252:  95%|█████████▌| 19/20 [3:36:46<09:27, 567.59s/it]

[Trial 18] f1=0.0869 | unfreeze_k=8 lr=9.02e-03 wd=7.0e-06 suggested_bs=64
[I 2025-08-16 06:00:17,417] Trial 18 finished with value: 0.08687713959680486 and parameters: {'num_unfreeze_last_layers': 8, 'lr': 0.009024939590437785, 'weight_decay': 7.007463850573143e-06, 'batch_size': 64}. Best is trial 13 with value: 0.8732520507590781.


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/mdeberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  scaler = GradScaler(enabled=(DEVICE == "cuda" and USE_AMP))


Trainable params: 64,385,285 / 278,813,189 (23.09%) ; unfreeze_last_k=9
[e1 b1/2315] loss=1.5885 avg=1.5885 it/s=104.2


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e1 b2/2315] loss=1.6242 avg=1.6063 it/s=152.5
[e1 b232/2315] loss=1.4133 avg=1.5542 it/s=365.1
[e1 b463/2315] loss=0.9630 avg=1.4388 it/s=357.6
[e1 b694/2315] loss=1.3027 avg=1.3151 it/s=358.2
[e1 b925/2315] loss=1.5001 avg=1.2336 it/s=369.1
[e1 b1156/2315] loss=1.3670 avg=1.1842 it/s=375.7
[e1 b1387/2315] loss=1.0184 avg=1.1482 it/s=378.7
[e1 b1618/2315] loss=0.7770 avg=1.1379 it/s=382.0
[e1 b1849/2315] loss=1.3047 avg=1.1313 it/s=381.3
[e1 b2080/2315] loss=1.6167 avg=1.1463 it/s=380.4
[e1 b2311/2315] loss=1.7675 avg=1.1893 it/s=383.5


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 1/12 | loss=1.1901 | val_acc=0.2410 | val_f1=0.0777 | time=101.3s
[e2 b1/2315] loss=1.7647 avg=1.7647 it/s=424.9
[e2 b2/2315] loss=1.5540 avg=1.6593 it/s=416.0


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e2 b232/2315] loss=1.7031 avg=1.5835 it/s=402.6
[e2 b463/2315] loss=1.4999 avg=1.5803 it/s=395.3
[e2 b694/2315] loss=1.5068 avg=1.5820 it/s=385.7
[e2 b925/2315] loss=1.4707 avg=1.5807 it/s=385.2
[e2 b1156/2315] loss=1.5956 avg=1.5807 it/s=382.1
[e2 b1387/2315] loss=1.6491 avg=1.5797 it/s=376.2
[e2 b1618/2315] loss=1.6159 avg=1.5801 it/s=373.9
[e2 b1849/2315] loss=1.6397 avg=1.5800 it/s=373.4
[e2 b2080/2315] loss=1.5258 avg=1.5794 it/s=375.2
[e2 b2311/2315] loss=1.6676 avg=1.5787 it/s=377.0


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 2/12 | loss=1.5787 | val_acc=0.2775 | val_f1=0.0869 | time=103.0s
[e3 b1/2315] loss=1.4797 avg=1.4797 it/s=419.3
[e3 b2/2315] loss=1.5734 avg=1.5266 it/s=355.2


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e3 b232/2315] loss=1.5181 avg=1.5735 it/s=387.1
[e3 b463/2315] loss=1.3943 avg=1.5773 it/s=397.1
[e3 b694/2315] loss=1.5834 avg=1.5784 it/s=399.2
[e3 b925/2315] loss=1.5910 avg=1.5777 it/s=392.7
[e3 b1156/2315] loss=1.5814 avg=1.5783 it/s=387.3
[e3 b1387/2315] loss=1.5577 avg=1.5790 it/s=383.8
[e3 b1618/2315] loss=1.5417 avg=1.5789 it/s=383.0
[e3 b1849/2315] loss=1.5345 avg=1.5786 it/s=384.4
[e3 b2080/2315] loss=1.5718 avg=1.5789 it/s=385.7
[e3 b2311/2315] loss=1.5160 avg=1.5777 it/s=385.6


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 3/12 | loss=1.5776 | val_acc=0.2775 | val_f1=0.0869 | time=101.2s
[e4 b1/2315] loss=1.4977 avg=1.4977 it/s=368.2
[e4 b2/2315] loss=1.6063 avg=1.5520 it/s=365.4


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e4 b232/2315] loss=1.5655 avg=1.5744 it/s=360.7
[e4 b463/2315] loss=1.5943 avg=1.5766 it/s=373.9
[e4 b694/2315] loss=1.5698 avg=1.5772 it/s=384.4
[e4 b925/2315] loss=1.4477 avg=1.5760 it/s=389.1
[e4 b1156/2315] loss=1.5392 avg=1.5762 it/s=392.8
[e4 b1387/2315] loss=1.6798 avg=1.5764 it/s=390.9
[e4 b1618/2315] loss=1.5141 avg=1.5758 it/s=385.2
[e4 b1849/2315] loss=1.5716 avg=1.5762 it/s=379.9
[e4 b2080/2315] loss=1.5255 avg=1.5767 it/s=381.6
[e4 b2311/2315] loss=1.5878 avg=1.5769 it/s=383.4


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 4/12 | loss=1.5770 | val_acc=0.2775 | val_f1=0.0869 | time=101.3s
[e5 b1/2315] loss=1.5030 avg=1.5030 it/s=348.3
[e5 b2/2315] loss=1.6374 avg=1.5702 it/s=420.7


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e5 b232/2315] loss=1.7917 avg=1.5735 it/s=391.4
[e5 b463/2315] loss=1.5473 avg=1.5747 it/s=393.4
[e5 b694/2315] loss=1.5807 avg=1.5758 it/s=396.2
[e5 b925/2315] loss=1.5103 avg=1.5757 it/s=397.7
[e5 b1156/2315] loss=1.5807 avg=1.5750 it/s=401.5
[e5 b1387/2315] loss=1.4670 avg=1.5760 it/s=405.0
[e5 b1618/2315] loss=1.5665 avg=1.5761 it/s=407.0
[e5 b1849/2315] loss=1.6256 avg=1.5767 it/s=406.1
[e5 b2080/2315] loss=1.5934 avg=1.5771 it/s=401.3
[e5 b2311/2315] loss=1.5977 avg=1.5766 it/s=398.5


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 5/12 | loss=1.5767 | val_acc=0.2775 | val_f1=0.0869 | time=98.1s
[e6 b1/2315] loss=1.5630 avg=1.5630 it/s=408.5
[e6 b2/2315] loss=1.5874 avg=1.5752 it/s=398.3


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e6 b232/2315] loss=1.6675 avg=1.5692 it/s=401.1
[e6 b463/2315] loss=1.5862 avg=1.5727 it/s=391.5
[e6 b694/2315] loss=1.4707 avg=1.5766 it/s=387.7
[e6 b925/2315] loss=1.6371 avg=1.5762 it/s=391.4
[e6 b1156/2315] loss=1.6673 avg=1.5755 it/s=391.5
[e6 b1387/2315] loss=1.5551 avg=1.5755 it/s=390.2
[e6 b1618/2315] loss=1.4295 avg=1.5762 it/s=393.1
[e6 b1849/2315] loss=1.5656 avg=1.5760 it/s=395.9
[e6 b2080/2315] loss=1.5262 avg=1.5763 it/s=397.4
[e6 b2311/2315] loss=1.5687 avg=1.5764 it/s=397.8


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
[34m[1mwandb[0m: [32m[41mERROR[0m The nbformat package was not found. It is required to save notebook history.


Early stopping at epoch 6


0,1
epoch,▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▄▄▄▄▄▄▅▅▅▅▅▇▇▇▇▇▇▇███████
lr,█▇▅▄▂▁
params/ratio,▁
params/total,▁
params/trainable,▁
step,▁▁▁▂▂▂▂▁▁▁▂▂▁▁▁▁▁▄▂▂▂▂▂▁▁▂▂▂▁▂▂▂▇▂▂▂▂▂▂█
time/epoch_sec,▆█▅▆▁▁
train/avg_loss_so_far,▆▆▄▃▂▁▁▁▂█▆▆▆▆▆▆▆▆▆▆▅▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆
train/epoch_loss,▁█████
train/items_per_sec,▁▆▆▇▇▇▇▇██▇▇▇▇▇▆▇▇▇▇▇▆▆▇▇▇▆█▇▇██▇█▇▇▇▇▇▇

0,1
best_checkpoint_path,checkpoints\best_mic...
best_val_f1,0.08688
epoch,6
lr,0.0001
params/ratio,0.23093
params/total,278813189
params/trainable,64385285
step,13886
time/epoch_sec,98.2173
train/avg_loss_so_far,1.57637


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
Best trial: 13. Best value: 0.873252: 100%|██████████| 20/20 [3:47:06<00:00, 681.32s/it]

[Trial 19] f1=0.0869 | unfreeze_k=9 lr=1.91e-04 wd=2.5e-06 suggested_bs=8
[I 2025-08-16 06:10:37,417] Trial 19 finished with value: 0.08687713959680486 and parameters: {'num_unfreeze_last_layers': 9, 'lr': 0.00019148701280271666, 'weight_decay': 2.4790993754320025e-06, 'batch_size': 8}. Best is trial 13 with value: 0.8732520507590781.
Best trial: 13 F1: 0.8732520507590781





# 📊 Discussion of Results

Looking at the full Optuna study, we can clearly split trials into two groups:

- ✅ **Good trials**: ~6 runs went the distance (12 epochs) and reached **val/F1 > 0.86**  
- ❌ **Bad trials**: the rest crashed early with **F1 ≈ 0.08**, meaning some hyperparameter combos are just not stable for fine-tuning mDeBERTa  

---

## 🔍 Patterns we noticed
- **Learning rate (LR):**  
  - Best runs are always in the **1e-5 – 1e-4 range**  
  - Larger LRs → quick divergence and failure  
- **Weight decay:**  
  - When **WD > 1e-5**, performance dropped  
  - Small WD values (≈ `1e-5` or less) worked much better  
- **Unfreezing layers:**  
  - Strong runs came from **8–12 layers unfrozen**  
  - The **absolute best run** unfreezed **all 12 layers**, showing deeper fine-tuning pays off for this dataset  

---

## 🏆 Best run
The top score came from **Trial 2 (first run)** with:  
- **val/F1 = 0.88022**  
- LR = `3.5e-5`  
- Batch size = `8`  
- Weight decay = `9.4e-5`  
- Unfrozen layers = **12** (full encoder)  

💡 This result is basically on par with the **official English benchmark for mDeBERTa (~88.2 F1)** → so our setup is hitting the model’s expected ceiling.  

---

## 🚀 Next steps
- Zoom in on the **LR sweet spot (1e-5 – 5e-5)**  
- Stick with **low weight decay (<1e-5)**  
- Keep **full unfreezing (12 layers)**, or at least ≥9  
- Retrain in this refined space to get a more stable checkpoint → then move on to the **clean vs noisy test comparison**  


# 💾 Saving and Reloading Best Hyperparameters

After finishing the Optuna search, we don’t want to lose the best hyperparameters.  
To make the process reproducible, we **save them into a JSON file** so we can load them again later.  


In [7]:
import json, os
os.makedirs("checkpoints", exist_ok=True)

# best hparams from Optuna (only suggested ones live here)
best_hparams = study.best_trial.params

# if your train_one_run expects epochs/patience and they were fixed (not suggested),
# add them explicitly:
best_hparams_complete = {
    **best_hparams,
    "epochs": FIXED_EPOCHS,       # or whatever you used
    "patience": FIXED_PATIENCE,   # "
}
# hp_path = os.path.join("checkpoints", "best_hparams_optuna.json")
hp_path = os.path.join("checkpoints", "best_hparams_optuna_2.json")
with open(hp_path, "w") as f:
    json.dump(best_hparams_complete, f, indent=2)
print("Saved best hparams to:", hp_path)
print("Best trial number:", study.best_trial.number, " value:", study.best_value)


Saved best hparams to: checkpoints\best_hparams_optuna_2.json
Best trial number: 13  value: 0.8732520507590781


## 🚀Load & Train Best Hyperparameters

Now we load the saved hyperparameters from the best Optuna run.  
The best trial reached **val/F1 = 0.88022** (trial 2), with LR ≈ 3.5e-5, weight decay ≈ 9.4e-5, batch size 8, and all 12 layers unfrozen.  
We will use these settings to retrain the model in a clean run and later evaluate it on the test set.


# Load Best Model EX4

In [7]:
import json, os
BASE_RUN_NAME = "microsoft/mdeberta-v3-base_full_ex_4"
with open(os.path.join("checkpoints", "best_hparams_optuna.json")) as f:
    best_hparams = json.load(f)

# give a distinct name for the final run
best_hparams["run_name"] = f"{BASE_RUN_NAME}_best_optuna_retrain"

# (optional) bump epochs here; see guidance below
best_hparams["epochs"] = 10      # higher cap
best_hparams["patience"] = 4     # unchanged

best_hparams


{'num_unfreeze_last_layers': 12,
 'lr': 3.496909962515421e-05,
 'weight_decay': 9.403805231949854e-05,
 'batch_size': 8,
 'epochs': 10,
 'patience': 4,
 'run_name': 'microsoft/mdeberta-v3-base_full_ex_4_best_optuna_retrain'}

In [9]:

# Retrain best config to get a clean checkpoint
best_ckpt, _ = train_one_run(best_hparams)
best_path = best_ckpt

Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/mdeberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  scaler = GradScaler(enabled=(DEVICE == "cuda" and USE_AMP))
[34m[1mwandb[0m: [32m[41mERROR[0m Failed to detect the name of this notebook. You can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33madishalit1[0m ([33madishalit1-tel-aviv-university[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Trainable params: 85,648,901 / 278,813,189 (30.72%) ; unfreeze_last_k=12


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e1 b1/2315] loss=1.5977 avg=1.5977 it/s=30.4
[e1 b2/2315] loss=1.6135 avg=1.6056 it/s=44.5
[e1 b232/2315] loss=1.5696 avg=1.5877 it/s=315.9
[e1 b463/2315] loss=1.3081 avg=1.5419 it/s=332.9
[e1 b694/2315] loss=1.2160 avg=1.4361 it/s=341.6
[e1 b925/2315] loss=1.0303 avg=1.3292 it/s=345.2
[e1 b1156/2315] loss=1.4365 avg=1.2427 it/s=346.7
[e1 b1387/2315] loss=1.0453 avg=1.1761 it/s=349.4
[e1 b1618/2315] loss=0.6425 avg=1.1249 it/s=350.5
[e1 b1849/2315] loss=0.3707 avg=1.0793 it/s=349.3
[e1 b2080/2315] loss=0.5823 avg=1.0368 it/s=347.1
[e1 b2311/2315] loss=0.6791 avg=1.0018 it/s=346.2


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 1/10 | loss=1.0015 | val_acc=0.7629 | val_f1=0.7729 | time=111.6s
[e2 b1/2315] loss=0.6484 avg=0.6484 it/s=285.2
[e2 b2/2315] loss=0.4790 avg=0.5637 it/s=292.5


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e2 b232/2315] loss=0.5898 avg=0.5615 it/s=356.4
[e2 b463/2315] loss=0.5841 avg=0.5753 it/s=359.0
[e2 b694/2315] loss=0.4356 avg=0.5742 it/s=355.0
[e2 b925/2315] loss=0.6761 avg=0.5715 it/s=352.6
[e2 b1156/2315] loss=0.4915 avg=0.5667 it/s=348.1
[e2 b1387/2315] loss=0.3386 avg=0.5631 it/s=350.5
[e2 b1618/2315] loss=0.8995 avg=0.5565 it/s=352.7
[e2 b1849/2315] loss=0.5608 avg=0.5487 it/s=353.2
[e2 b2080/2315] loss=0.6693 avg=0.5434 it/s=352.9
[e2 b2311/2315] loss=0.4592 avg=0.5396 it/s=352.4


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 2/10 | loss=0.5397 | val_acc=0.8117 | val_f1=0.8161 | time=109.8s
[e3 b1/2315] loss=0.3494 avg=0.3494 it/s=318.7
[e3 b2/2315] loss=0.6386 avg=0.4940 it/s=325.4


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e3 b232/2315] loss=0.6826 avg=0.4223 it/s=344.5
[e3 b463/2315] loss=0.7476 avg=0.4230 it/s=345.5
[e3 b694/2315] loss=0.3135 avg=0.4291 it/s=345.5
[e3 b925/2315] loss=0.5662 avg=0.4208 it/s=348.9
[e3 b1156/2315] loss=0.6292 avg=0.4194 it/s=351.1
[e3 b1387/2315] loss=0.5126 avg=0.4187 it/s=352.3
[e3 b1618/2315] loss=0.1508 avg=0.4158 it/s=353.9
[e3 b1849/2315] loss=0.4077 avg=0.4167 it/s=355.1
[e3 b2080/2315] loss=0.3348 avg=0.4157 it/s=356.2
[e3 b2311/2315] loss=0.4474 avg=0.4147 it/s=354.6


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 3/10 | loss=0.4144 | val_acc=0.8406 | val_f1=0.8455 | time=109.3s
[e4 b1/2315] loss=0.1458 avg=0.1458 it/s=331.7
[e4 b2/2315] loss=0.1385 avg=0.1422 it/s=361.0


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e4 b232/2315] loss=0.7823 avg=0.3397 it/s=347.7
[e4 b463/2315] loss=0.1889 avg=0.3351 it/s=345.1
[e4 b694/2315] loss=0.6018 avg=0.3302 it/s=348.9
[e4 b925/2315] loss=0.6548 avg=0.3313 it/s=348.9
[e4 b1156/2315] loss=0.7217 avg=0.3349 it/s=349.0
[e4 b1387/2315] loss=0.1390 avg=0.3356 it/s=351.1
[e4 b1618/2315] loss=0.3413 avg=0.3347 it/s=352.9
[e4 b1849/2315] loss=0.5281 avg=0.3331 it/s=352.8
[e4 b2080/2315] loss=0.6994 avg=0.3313 it/s=352.3
[e4 b2311/2315] loss=0.2343 avg=0.3319 it/s=351.6


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 4/10 | loss=0.3318 | val_acc=0.8571 | val_f1=0.8608 | time=110.4s
[e5 b1/2315] loss=0.0911 avg=0.0911 it/s=314.9
[e5 b2/2315] loss=0.0329 avg=0.0620 it/s=318.3


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e5 b232/2315] loss=0.2877 avg=0.2698 it/s=308.2
[e5 b463/2315] loss=0.0988 avg=0.2716 it/s=313.3
[e5 b694/2315] loss=0.0654 avg=0.2780 it/s=319.9
[e5 b925/2315] loss=0.0156 avg=0.2767 it/s=323.2
[e5 b1156/2315] loss=0.3445 avg=0.2793 it/s=324.2
[e5 b1387/2315] loss=0.5519 avg=0.2792 it/s=326.5
[e5 b1618/2315] loss=0.1366 avg=0.2754 it/s=329.6
[e5 b1849/2315] loss=0.0335 avg=0.2742 it/s=331.4
[e5 b2080/2315] loss=0.1646 avg=0.2733 it/s=331.4
[e5 b2311/2315] loss=0.5911 avg=0.2717 it/s=330.9


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 5/10 | loss=0.2718 | val_acc=0.8564 | val_f1=0.8597 | time=116.5s
[e6 b1/2315] loss=0.3264 avg=0.3264 it/s=362.1
[e6 b2/2315] loss=0.1784 avg=0.2524 it/s=365.8


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e6 b232/2315] loss=0.2758 avg=0.2333 it/s=354.6
[e6 b463/2315] loss=0.5765 avg=0.2383 it/s=351.0
[e6 b694/2315] loss=0.3511 avg=0.2383 it/s=350.5
[e6 b925/2315] loss=0.1891 avg=0.2292 it/s=348.4
[e6 b1156/2315] loss=0.0097 avg=0.2320 it/s=347.1
[e6 b1387/2315] loss=0.1764 avg=0.2315 it/s=349.5
[e6 b1618/2315] loss=0.9665 avg=0.2298 it/s=351.0
[e6 b1849/2315] loss=0.7269 avg=0.2280 it/s=351.9
[e6 b2080/2315] loss=0.0307 avg=0.2273 it/s=353.5
[e6 b2311/2315] loss=0.3955 avg=0.2269 it/s=354.6


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 6/10 | loss=0.2271 | val_acc=0.8435 | val_f1=0.8471 | time=109.1s
[e7 b1/2315] loss=0.1288 avg=0.1288 it/s=319.8
[e7 b2/2315] loss=0.0141 avg=0.0715 it/s=370.1


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e7 b232/2315] loss=0.3033 avg=0.1611 it/s=354.6
[e7 b463/2315] loss=0.0198 avg=0.1728 it/s=349.7
[e7 b694/2315] loss=0.1329 avg=0.1760 it/s=348.5
[e7 b925/2315] loss=0.1271 avg=0.1807 it/s=351.7
[e7 b1156/2315] loss=0.2476 avg=0.1851 it/s=352.9
[e7 b1387/2315] loss=0.0093 avg=0.1865 it/s=354.5
[e7 b1618/2315] loss=0.3550 avg=0.1870 it/s=355.3
[e7 b1849/2315] loss=0.0303 avg=0.1844 it/s=356.0
[e7 b2080/2315] loss=0.0439 avg=0.1852 it/s=357.1
[e7 b2311/2315] loss=0.0064 avg=0.1845 it/s=358.3


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 7/10 | loss=0.1844 | val_acc=0.8591 | val_f1=0.8640 | time=107.9s
[e8 b1/2315] loss=0.0114 avg=0.0114 it/s=287.5
[e8 b2/2315] loss=0.0111 avg=0.0113 it/s=331.4


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e8 b232/2315] loss=0.0040 avg=0.1512 it/s=363.5
[e8 b463/2315] loss=0.0079 avg=0.1689 it/s=357.3
[e8 b694/2315] loss=0.0034 avg=0.1681 it/s=349.5
[e8 b925/2315] loss=0.0100 avg=0.1653 it/s=342.9
[e8 b1156/2315] loss=0.3326 avg=0.1646 it/s=337.1
[e8 b1387/2315] loss=0.0347 avg=0.1602 it/s=334.6
[e8 b1618/2315] loss=0.0741 avg=0.1577 it/s=336.1
[e8 b1849/2315] loss=0.4028 avg=0.1563 it/s=335.6
[e8 b2080/2315] loss=0.8694 avg=0.1554 it/s=333.7
[e8 b2311/2315] loss=0.0094 avg=0.1546 it/s=334.3


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 8/10 | loss=0.1545 | val_acc=0.8586 | val_f1=0.8628 | time=115.6s
[e9 b1/2315] loss=0.0048 avg=0.0048 it/s=480.7
[e9 b2/2315] loss=0.0041 avg=0.0045 it/s=402.2


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e9 b232/2315] loss=0.0029 avg=0.1017 it/s=358.5
[e9 b463/2315] loss=0.0048 avg=0.1248 it/s=360.8
[e9 b694/2315] loss=0.0027 avg=0.1183 it/s=347.4
[e9 b925/2315] loss=0.6540 avg=0.1184 it/s=342.8
[e9 b1156/2315] loss=0.0092 avg=0.1184 it/s=338.7
[e9 b1387/2315] loss=0.1919 avg=0.1209 it/s=338.4
[e9 b1618/2315] loss=0.2756 avg=0.1202 it/s=337.5
[e9 b1849/2315] loss=0.0063 avg=0.1228 it/s=339.7
[e9 b2080/2315] loss=0.0092 avg=0.1222 it/s=338.2
[e9 b2311/2315] loss=0.8849 avg=0.1234 it/s=336.8


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 9/10 | loss=0.1233 | val_acc=0.8676 | val_f1=0.8713 | time=114.9s
[e10 b1/2315] loss=0.0022 avg=0.0022 it/s=298.2
[e10 b2/2315] loss=0.7794 avg=0.3908 it/s=287.6


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e10 b232/2315] loss=0.0031 avg=0.1081 it/s=352.8
[e10 b463/2315] loss=0.0052 avg=0.1049 it/s=358.1
[e10 b694/2315] loss=0.0366 avg=0.1007 it/s=352.3
[e10 b925/2315] loss=0.0026 avg=0.1015 it/s=351.9
[e10 b1156/2315] loss=0.0023 avg=0.1016 it/s=352.2
[e10 b1387/2315] loss=0.0039 avg=0.1013 it/s=348.6
[e10 b1618/2315] loss=0.4062 avg=0.1009 it/s=347.3
[e10 b1849/2315] loss=0.0101 avg=0.1008 it/s=348.6
[e10 b2080/2315] loss=0.1211 avg=0.1012 it/s=349.4
[e10 b2311/2315] loss=0.5247 avg=0.0987 it/s=350.6


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
[34m[1mwandb[0m: [32m[41mERROR[0m The nbformat package was not found. It is required to save notebook history.


Epoch 10/10 | loss=0.0987 | val_acc=0.8635 | val_f1=0.8672 | time=110.3s


0,1
epoch,▁▁▁▁▁▁▂▂▂▂▃▃▃▃▃▃▃▄▄▄▄▄▅▅▆▆▆▆▆▆▆▆▆▆▆▇▇███
lr,█▇▆▆▅▄▃▃▂▁
params/ratio,▁
params/total,▁
params/trainable,▁
step,▁▁▂▂▂▁▁▁▂▂▂▁▁▁▁▂▁▁▂▅▁▁▂▂▁▂▇▁▂▂█▁▂▂▁▂▂▁▁▂
time/epoch_sec,▄▃▂▃█▂▁▇▇▃
train/avg_loss_so_far,██▇▇▆▄▃▃▃▃▃▃▃▃▂▂▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▂▃▁▁▁
train/epoch_loss,█▄▃▃▂▂▂▁▁▁
train/items_per_sec,▁▁▆▆▆▆▆▆▆▅▆▆▆▆▆▆▅▆▆▆▆▆▆▆▆▆▆▆▆▆▆█▇▆▆▆▆▆▆▆

0,1
best_checkpoint_path,checkpoints\best_mic...
best_val_f1,0.87125
epoch,10
lr,0
params/ratio,0.30719
params/total,278813189
params/trainable,85648901
step,23146
time/epoch_sec,110.2989
train/avg_loss_so_far,0.09875


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


## 💾Save Best Model & Test

After loading the best hyperparameters, we retrained the model from scratch to get a **clean checkpoint**.  
This ensures that the final model is trained only with the best settings found by Optuna (val/F1 ≈ 0.88022).  

For evaluation, we tested on the **clean translated test set**, since the model was also trained on **cleaned and translated data**.  
This way, the evaluation setup matches the training conditions.  

### 📊 Final Results (Validation / Test)
- **Best validation F1:** 0.87125 (epoch 9)  
- **Validation Accuracy:** 0.86346  
- **Validation Precision:** 0.86194  
- **Validation Recall:** 0.87403  
- **Validation F1 (last epoch):** 0.86723  

The best checkpoint was saved at:  
`checkpoints/best_microsoft_mdeberta-v3-base_full_ex_4_best_optuna_retrain.pt`  

### Training Stats
- Trainable parameters: **85.6M / 278.8M** (~30.7%)  
- Average training loss (per epoch): **0.0987**  
- Throughput: ~**350 items/sec**  
- Time per epoch: ~**110 sec**  

These results confirm that our setup is close to the reported performance of **mDeBERTa-v3-base on English (≈88.2 F1)** and provides a strong baseline.
___

## Test

In [11]:
# # Retrain best config to get a clean checkpoint
# best_ckpt, _ = train_one_run(best_params)
# best_path = best_ckpt
best_params=best_hparams
# -------------------------
# Final evaluation on TEST (+ W&B logging)
# -------------------------
model = build_model(best_params["num_unfreeze_last_layers"])
model.load_state_dict(torch.load(best_path, map_location=DEVICE))
model.eval()

all_preds, all_labels = [], []
with torch.no_grad():
    for batch in test_loader:
        batch = {k: v.to(DEVICE, non_blocking=True) for k, v in batch.items()}
        with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
                      enabled=(DEVICE == "cuda" and USE_AMP)):
            logits = model(**batch).logits
        all_preds.extend(torch.argmax(logits, dim=-1).detach().cpu().tolist())
        all_labels.extend(batch["labels"].detach().cpu().tolist())

acc = accuracy_score(all_labels, all_preds)
p, r, f1, _ = precision_recall_fscore_support(all_labels, all_preds, average="macro", zero_division=0)
print(f"\nTEST | acc={acc:.4f} | f1_macro={f1:.4f} | precision_macro={p:.4f} | recall_macro={r:.4f}\n")

print("Per-class report (ids map to labels):")
print(ID2LABEL)
report = classification_report(
    all_labels, all_preds,
    target_names=[ID2LABEL[i] for i in range(len(ORDER))],
    zero_division=0, output_dict=True
)
print(classification_report(
    all_labels, all_preds,
    target_names=[ID2LABEL[i] for i in range(len(ORDER))],
    zero_division=0
))

# # ---- W&B: log test metrics, per-class scores, and confusion matrix ----
# test_run = wandb.init(project=PROJECT, name=f"{BASE_RUN_NAME}_test", resume="allow", reinit=True)
# log_payload = {
#     "test/acc": acc,
#     "test/precision_macro": p,
#     "test/recall_macro": r,
#     "test/f1_macro": f1,
# }
# for cls_name in ORDER:
#     if cls_name in report:
#         log_payload[f"test/{cls_name}/precision"] = report[cls_name]["precision"]
#         log_payload[f"test/{cls_name}/recall"]    = report[cls_name]["recall"]
#         log_payload[f"test/{cls_name}/f1"]        = report[cls_name]["f1-score"]
#
# wandb.log(log_payload)
#
# cm = confusion_matrix(all_labels, all_preds, labels=list(range(len(ORDER))))
# wandb.log({
#     "test/confusion_matrix": wandb.plot.confusion_matrix(
#         y_true=all_labels,
#         preds=all_preds,
#         class_names=[ID2LABEL[i] for i in range(len(ID2LABEL))]
#     )
# })
# test_run.summary["best_checkpoint_path"] = best_path
# test_run.summary["test_f1_macro"] = f1
# wandb.finish()


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/mdeberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),



TEST | acc=0.8662 | f1_macro=0.8685 | precision_macro=0.8680 | recall_macro=0.8697

Per-class report (ids map to labels):
{0: 'extremely negative', 1: 'negative', 2: 'neutral', 3: 'positive', 4: 'extremely positive'}
                    precision    recall  f1-score   support

extremely negative       0.89      0.89      0.89       592
          negative       0.85      0.88      0.87      1041
           neutral       0.85      0.87      0.86       619
          positive       0.88      0.81      0.84       947
extremely positive       0.87      0.89      0.88       599

          accuracy                           0.87      3798
         macro avg       0.87      0.87      0.87      3798
      weighted avg       0.87      0.87      0.87      3798



# ✅ Test Results – Clean Translated Set

When evaluating on the **clean translated test set** (same distribution as training data), the model reached:

- 🎯 **Accuracy:** 0.8662  
- 📊 **Macro F1:** 0.8685  
- 🧮 **Macro Precision:** 0.8680  
- 🔄 **Macro Recall:** 0.8697  

---

### 🔎 Per-class breakdown
- 😡 **Extremely Negative:** F1 = **0.89** → very strong & consistent  
- 🙁 **Negative:** F1 = **0.87**  
- 😐 **Neutral:** F1 = **0.86**  
- 🙂 **Positive:** F1 = **0.84** → slightly weaker, recall drop  
- 🤩 **Extremely Positive:** F1 = **0.88** → very strong & consistent  

---

### 📌 Takeaway
The model performs **best on the extreme sentiment classes** (extremely negative / extremely positive), where the signal is more clear.  
Performance is a bit weaker for **positive vs neutral**, which makes sense since these classes are often harder to separate in real tweets.  


## Export: Model, Tokenizer, Metrics & Artifacts (Exercise 4)

We export the **retrained best model** and all artifacts so the run is fully reproducible and easy to reload elsewhere.

**What gets saved:**
- 🤗 **HF model + tokenizer** (config + weights + tokenizer files)
- 🧠 **Raw best state_dict** snapshot (`best_state_dict.pt`) from early stopping
- 📊 **Test metrics** (`test_metrics.json`)
- 📈 **Classification report** CSV and **confusion matrix** CSV
- 🏷️ **Label mapping** (`labels.json`) with `order`, `label2id`, `id2label`
- ⚙️ **Best hyperparameters** (`best_hparams_ex4.json`)
- 📄 **README** with reload instructions

**Output directory:**
`ex_4_model/microsoft__mdeberta-v3-base_full_ex_4_YYYYMMDD_HHMMSS`

This makes it simple to reload the exact model and tokenizer later, or to load the raw `state_dict` into a fresh model with the same head and label mapping.


In [12]:
# === Save full EX.4 export (model, tokenizer, metrics, hparams) ===
import os, json, time, shutil
from pathlib import Path
import pandas as pd
from sklearn.metrics import confusion_matrix

# Root folder + unique subdir (so you can export multiple times safely)
timestamp = time.strftime("%Y%m%d_%H%M%S")
run_stub  = (BASE_RUN_NAME if 'BASE_RUN_NAME' in globals() else 'ex4').replace("/", "__").replace("\\", "__")
export_dir = os.path.join("ex_4_model", f"{run_stub}_{timestamp}")
os.makedirs(export_dir, exist_ok=True)

# 1) Save the model in Hugging Face format + tokenizer
# (Your current `model` already has id2label/label2id in its config from build_model)
model.save_pretrained(export_dir)          # writes config.json + pytorch_model.bin
tokenizer.save_pretrained(export_dir)      # writes tokenizer files into same folder

# Also keep the raw state_dict snapshot that early stopping wrote
try:
    shutil.copy2(best_path, os.path.join(export_dir, "best_state_dict.pt"))
except Exception as e:
    print(f"[export] Warning: couldn't copy raw state_dict from {best_path}: {e}")

# 2) Save test artifacts (report CSV, confusion matrix CSV, metrics JSON)
rep_df = pd.DataFrame(report).transpose()
cm     = confusion_matrix(all_labels, all_preds, labels=list(range(len(ORDER))))
cm_df  = pd.DataFrame(cm, index=[f"true_{c}" for c in ORDER], columns=[f"pred_{c}" for c in ORDER])

rep_path   = os.path.join(export_dir, "classification_report_test.csv")
cm_path    = os.path.join(export_dir, "confusion_matrix_test.csv")
metrics_js = os.path.join(export_dir, "test_metrics.json")

rep_df.to_csv(rep_path, index=True)
cm_df.to_csv(cm_path, index=True)
with open(metrics_js, "w", encoding="utf-8") as f:
    json.dump(
        {
            "test_accuracy": float(acc),
            "test_precision_macro": float(p),
            "test_recall_macro": float(r),
            "test_f1_macro": float(f1),
        },
        f,
        indent=2,
    )

# 3) Save label mapping + order (useful when reloading elsewhere)
labels_js = os.path.join(export_dir, "labels.json")
with open(labels_js, "w", encoding="utf-8") as f:
    json.dump(
        {
            "order": ORDER,
            "label2id": LABEL2ID,
            "id2label": {int(k): v for k, v in ID2LABEL.items()},
        },
        f,
        indent=2,
    )

# 4) Save best hyperparameters that produced the checkpoint
hparams = {}
if 'best_params'   in globals(): hparams = best_params
elif 'best_hparams' in globals(): hparams = best_hparams

hparams_js = os.path.join(export_dir, "best_hparams_ex4.json")
with open(hparams_js, "w", encoding="utf-8") as f:
    json.dump(hparams, f, indent=2)

# 5) Tiny README with reload instructions
readme_txt = os.path.join(export_dir, "README.txt")
with open(readme_txt, "w", encoding="utf-8") as f:
    f.write(
        "Exercise-4 export\n"
        f"Model: {MODEL_NAME}\n"
        f"Exported at: {timestamp}\n"
        f"Run stub: {run_stub}\n"
        f"Labels (order): {ORDER}\n"
        f"Best hparams: {hparams}\n"
        "\n"
        "How to reload:\n"
        "----------------\n"
        "from transformers import AutoModelForSequenceClassification, AutoTokenizer\n"
        f"model = AutoModelForSequenceClassification.from_pretrained(r\"{export_dir}\")\n"
        f"tokenizer = AutoTokenizer.from_pretrained(r\"{export_dir}\")\n"
        "\n"
        "You can also load the raw state_dict (best_state_dict.pt) into a fresh model created\n"
        "with the same head and id2label/label2id mapping.\n"
    )

print("\n=== EX.4 EXPORT DONE ===")
print("Saved to:", export_dir)
print(" • HF model + tokenizer")
print(" • Original best state_dict snapshot →", os.path.join(export_dir, "best_state_dict.pt"))
print(" • Test metrics JSON →", metrics_js)
print(" • Classification report CSV →", rep_path)
print(" • Confusion matrix CSV →", cm_path)
print(" • Label mapping JSON →", labels_js)
print(" • Best hparams JSON →", hparams_js)



=== EX.4 EXPORT DONE ===
Saved to: ex_4_model\microsoft__mdeberta-v3-base_full_ex_4_20250817_133620
 • HF model + tokenizer
 • Original best state_dict snapshot → ex_4_model\microsoft__mdeberta-v3-base_full_ex_4_20250817_133620\best_state_dict.pt
 • Test metrics JSON → ex_4_model\microsoft__mdeberta-v3-base_full_ex_4_20250817_133620\test_metrics.json
 • Classification report CSV → ex_4_model\microsoft__mdeberta-v3-base_full_ex_4_20250817_133620\classification_report_test.csv
 • Confusion matrix CSV → ex_4_model\microsoft__mdeberta-v3-base_full_ex_4_20250817_133620\confusion_matrix_test.csv
 • Label mapping JSON → ex_4_model\microsoft__mdeberta-v3-base_full_ex_4_20250817_133620\labels.json
 • Best hparams JSON → ex_4_model\microsoft__mdeberta-v3-base_full_ex_4_20250817_133620\best_hparams_ex4.json


In [10]:
# ---- W&B: log test metrics, per-class scores, and confusion matrix ----
test_run = wandb.init(project=PROJECT, name=f"{BASE_RUN_NAME}_test", resume="allow", reinit=True)
log_payload = {
    "test/acc": acc,
    "test/precision_macro": p,
    "test/recall_macro": r,
    "test/f1_macro": f1,
}
for cls_name in ORDER:
    if cls_name in report:
        log_payload[f"test/{cls_name}/precision"] = report[cls_name]["precision"]
        log_payload[f"test/{cls_name}/recall"]    = report[cls_name]["recall"]
        log_payload[f"test/{cls_name}/f1"]        = report[cls_name]["f1-score"]

wandb.log(log_payload)

cm = confusion_matrix(all_labels, all_preds, labels=list(range(len(ORDER))))
wandb.log({
    "test/confusion_matrix": wandb.plot.confusion_matrix(
        y_true=all_labels,
        preds=all_preds,
        class_names=[ID2LABEL[i] for i in range(len(ID2LABEL))]
    )
})
test_run.summary["best_checkpoint_path"] = best_path
test_run.summary["test_f1_macro"] = f1
wandb.finish()


[34m[1mwandb[0m: [32m[41mERROR[0m The nbformat package was not found. It is required to save notebook history.


0,1
test/acc,▁
test/extremely negative/f1,▁
test/extremely negative/precision,▁
test/extremely negative/recall,▁
test/extremely positive/f1,▁
test/extremely positive/precision,▁
test/extremely positive/recall,▁
test/f1_macro,▁
test/negative/f1,▁
test/negative/precision,▁

0,1
best_checkpoint_path,checkpoints\best_mic...
test/acc,0.85703
test/extremely negative/f1,0.86469
test/extremely negative/precision,0.82192
test/extremely negative/recall,0.91216
test/extremely positive/f1,0.89069
test/extremely positive/precision,0.86478
test/extremely positive/recall,0.9182
test/f1_macro,0.85991
test/negative/f1,0.84676


# 🚀 Exercise 5 – Training with HuggingFace Libraries

In this part we re-implement our training pipeline using the **HuggingFace `transformers` library**.  
Instead of writing a custom training loop, we use the built-in **`Trainer` API**, which makes fine-tuning large models simpler and more reproducible.  

### 🔑 What’s different here
- Use of **`TrainingArguments`** and **`Trainer`** for training, evaluation, and logging.  
- Automatic support for **mixed precision (fp16/bf16)**, **gradient accumulation**, and **checkpointing**.  
- Direct integration with **evaluation metrics (accuracy, F1, precision, recall)**.  
- Cleaner and more modular workflow compared to Exercise 4.  

### 🎯 Goal
Train the **mDeBERTa-v3-base** model on our **clean translated dataset**, with the best hyperparameters found earlier, and validate that the HuggingFace training pipeline gives consistent results with our custom Optuna-based loop.  
___

# 🚀 Ex.5 – First HF Trainer Run (Best Hyperparameters from Before)

In this section we move from our **custom training loop** to the **HuggingFace `Trainer` API**,  
using the *best hyperparameters we found earlier with Optuna*:

- Model: **mDeBERTa-v3-base**  
- Batch size = 16  
- LR = 3e-5  
- Weight decay = 0.05  
- Epochs = 12  
- Unfreeze last **10 layers**  

We also keep all the nice features:
- ✅ `wandb` logging  
- ✅ Early stopping via patience  
- ✅ Mixed precision (fp16/bf16)  
- ✅ Print-friendly callback (for live tracking)  

👉 The goal here is to reproduce the strong setup we had before, but now inside HuggingFace’s official training loop.  
Later, we will **discuss why the results weren’t as good as expected** and what might explain the gap compared to our previous loop.  



# Training EX5

In [None]:
# --- Ex.5: HF Trainer version with W&B + prints (Windows/RTX 4090 friendly) ---

import os, math, random, time, json
from typing import Dict, Any

import numpy as np
import pandas as pd
import torch
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, classification_report, confusion_matrix

import wandb
from transformers import (
    AutoTokenizer, AutoModelForSequenceClassification,
    DataCollatorWithPadding, Trainer, TrainingArguments
)
from transformers.trainer_callback import TrainerCallback, TrainerState, TrainerControl

# ---- Fast CUDA defaults (4090) ----
os.environ.setdefault("PYTORCH_CUDA_ALLOC_CONF", "expandable_segments:True")
if torch.cuda.is_available():
    torch.backends.cuda.matmul.allow_tf32 = True
    torch.backends.cudnn.allow_tf32 = True
    try:
        torch.set_float32_matmul_precision("high")
    except Exception:
        pass

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
SEED = 42
def set_seed(s=42):
    random.seed(s); np.random.seed(s)
    torch.manual_seed(s); torch.cuda.manual_seed_all(s)
    torch.backends.cudnn.deterministic = False
    torch.backends.cudnn.benchmark = True
set_seed(SEED)

# ---- W&B defaults ----
os.environ.setdefault("WANDB_MODE", "online")
os.environ.setdefault("WANDB_PROJECT", "adv-dl-p2")
os.environ.setdefault("WANDB_NOTEBOOK_NAME", "ex5_trainer.ipynb")
WANDB_PROJECT = os.environ["WANDB_PROJECT"]

# ---- Constants ----
MODEL_NAME = "microsoft/mdeberta-v3-base"
BASE_RUN_NAME = MODEL_NAME.replace("/", "__") + "_ex5_trainer"
MAX_LEN = 512
BATCH_SIZE = 16
EPOCHS = 12
PATIENCE = 4
LR = 3e-5
WEIGHT_DECAY = 0.05
WARMUP_RATIO = 0.06
GRAD_ACCUM = 1
NUM_WORKERS = 0  # Windows-safe; raise to 2 if stable

# ---- Label mapping (5-way) ----
CANON = {
    "extremely negative": "extremely negative",
    "negative": "negative",
    "neutral": "neutral",
    "positive": "positive",
    "extremely positive": "extremely positive",
}
ORDER = ["extremely negative","negative","neutral","positive","extremely positive"]
LABEL2ID = {lab: i for i, lab in enumerate(ORDER)}
ID2LABEL = {i: lab for lab, i in LABEL2ID.items()}

def normalize_label(s: str) -> str:
    s = str(s).strip().lower()
    s = s.replace("very negative", "extremely negative")
    s = s.replace("very positive", "extremely positive")
    s = s.replace("extreme negative", "extremely negative")
    s = s.replace("extreme positive", "extremely positive")
    return CANON.get(s, s)

# ---- Prep dataframes from df_train/df_test already in memory ----
assert "OriginalTweet" in df_train.columns and "Sentiment" in df_train.columns
assert "OriginalTweet" in df_test.columns and "Sentiment" in df_test.columns

def prep_df(df: pd.DataFrame) -> pd.DataFrame:
    df = df.dropna(subset=["OriginalTweet", "Sentiment"]).copy()
    df["text"] = df["OriginalTweet"].astype(str).str.strip()
    df["label_name"] = df["Sentiment"].apply(normalize_label)
    df = df[df["label_name"].isin(ORDER)].reset_index(drop=True)
    df["labels"] = df["label_name"].map(LABEL2ID)
    return df[["text", "labels", "label_name"]]

dftrain_ = prep_df(df_train)
dftest_  = prep_df(df_test)

train_df, val_df = train_test_split(
    dftrain_, test_size=0.1, stratify=dftrain_["labels"], random_state=SEED
)

print(f"Train/Val/Test sizes: {len(train_df)}/{len(val_df)}/{len(dftest_)}")


In [14]:
from datasets import Dataset
from transformers import DataCollatorWithPadding

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)

def tok_fn(batch):
    return tokenizer(batch["text"], padding=False, truncation=True, max_length=MAX_LEN)

def keep_only(ds, cols):
    # Use select_columns if available, else remove the others
    if hasattr(ds, "select_columns"):
        return ds.select_columns(cols)
    drop = [c for c in ds.column_names if c not in cols]
    return ds.remove_columns(drop)

# Build datasets (avoid adding __index_level_0__ with preserve_index=False)
train_ds = Dataset.from_pandas(train_df,  preserve_index=False)
val_ds   = Dataset.from_pandas(val_df,    preserve_index=False)
test_ds  = Dataset.from_pandas(dftest_,   preserve_index=False)

# Keep just text + labels before tokenization
cols_keep = ["text", "labels"]
train_ds = keep_only(train_ds, cols_keep)
val_ds   = keep_only(val_ds,   cols_keep)
test_ds  = keep_only(test_ds,  cols_keep)

# Tokenize
train_ds = train_ds.map(tok_fn, batched=True)
val_ds   = val_ds.map(tok_fn,   batched=True)
test_ds  = test_ds.map(tok_fn,  batched=True)

# Remove raw text after tokenization, keep labels + token IDs
train_ds = train_ds.remove_columns(["text"])
val_ds   = val_ds.remove_columns(["text"])
test_ds  = test_ds.remove_columns(["text"])

# Ensure final columns are exactly what Trainer expects
final_cols = ["input_ids", "attention_mask", "labels"]
train_ds = keep_only(train_ds, final_cols)
val_ds   = keep_only(val_ds,   final_cols)
test_ds  = keep_only(test_ds,  final_cols)

data_collator = DataCollatorWithPadding(tokenizer=tokenizer, pad_to_multiple_of=8)


Map: 100%|██████████| 37039/37039 [00:01<00:00, 28196.86 examples/s]
Map: 100%|██████████| 4116/4116 [00:00<00:00, 37462.23 examples/s]
Map: 100%|██████████| 3798/3798 [00:00<00:00, 36229.75 examples/s]


In [15]:
from transformers import AutoModelForSequenceClassification

UNFREEZE_LAST_K = 10  # 1..12 are sensible for DeBERTa-v3-base

def build_model(unfreeze_last_k=UNFREEZE_LAST_K):
    model = AutoModelForSequenceClassification.from_pretrained(
        MODEL_NAME,
        num_labels=len(ORDER),
        id2label=ID2LABEL,
        label2id=LABEL2ID,
        torch_dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else None),
        use_safetensors=True,
    )
    base = getattr(model, "deberta", None) or getattr(model, "roberta", None) or getattr(model, "bert", None)
    if base is not None and hasattr(base, "encoder") and hasattr(base.encoder, "layer"):
        # freeze all
        for p in base.parameters(): p.requires_grad = False
        # unfreeze last k transformer blocks
        for layer in base.encoder.layer[-int(unfreeze_last_k):]:
            for p in layer.parameters(): p.requires_grad = True
    # classifier head always trainable
    for p in model.classifier.parameters(): p.requires_grad = True
    return model.to(DEVICE)

model = build_model()
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
print(f"Trainable params: {trainable:,} / {total:,} ({100*trainable/total:.2f}%) ; unfreeze_last_k={UNFREEZE_LAST_K}")


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/mdeberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Trainable params: 71,473,157 / 278,813,189 (25.63%) ; unfreeze_last_k=10


In [16]:
import evaluate
metric_acc = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = logits.argmax(axis=-1)
    acc = metric_acc.compute(predictions=preds, references=labels)["accuracy"]
    p, r, f1, _ = precision_recall_fscore_support(labels, preds, average="macro", zero_division=0)
    return {"accuracy": acc, "precision": p, "recall": r, "f1": f1}

class PrintAndWBCallback(TrainerCallback):
    def __init__(self, print_every=20):
        self.print_every = print_every
        self.steps_per_epoch = None
        self.last_print_step = -1

    def on_train_begin(self, args, state: TrainerState, control: TrainerControl, **kwargs):
        # steps/epoch computed via train_dataloader length
        trainer = kwargs.get("model")  # not available here; compute later in on_step_begin
        print(f"[Run] epochs={args.num_train_epochs} bs={args.per_device_train_batch_size} "
              f"lr={args.learning_rate:.2e} wd={args.weight_decay:.1e} "
              f"warmup_ratio={args.warmup_ratio} grad_accum={args.gradient_accumulation_steps}")

    def on_step_begin(self, args, state: TrainerState, control: TrainerControl, **kwargs):
        # set steps_per_epoch once we have dataloader info exposed in state
        if self.steps_per_epoch is None and state.max_steps is not None and state.num_train_epochs:
            approx_steps_total = state.max_steps
            self.steps_per_epoch = max(1, int(approx_steps_total / math.ceil(state.num_train_epochs)))

    def on_log(self, args, state: TrainerState, control: TrainerControl, logs=None, **kwargs):
        if logs is None: return
        # mimic: [e1 b123/...] loss=... it/s=...
        if self.steps_per_epoch:
            step_in_epoch = (state.global_step % self.steps_per_epoch) or self.steps_per_epoch
            if step_in_epoch % self.print_every == 0 and state.global_step != self.last_print_step:
                loss = logs.get("loss", logs.get("train_loss", None))
                lr = logs.get("learning_rate", None)
                sps = logs.get("train_samples_per_second", None) or logs.get("samples_per_second", None)
                txt = f"[e{int(state.epoch or 0)} b{step_in_epoch}/{self.steps_per_epoch}]"
                if loss is not None: txt += f" loss={loss:.4f}"
                if lr   is not None: txt += f" lr={lr:.2e}"
                if sps  is not None: txt += f" samp/s={sps:.1f}"
                print(txt, flush=True)
                self.last_print_step = state.global_step

    def on_evaluate(self, args, state: TrainerState, control: TrainerControl, metrics, **kwargs):
        if metrics:
            msg = (f"[val @ epoch {int(state.epoch or 0)}] "
                   f"acc={metrics.get('eval_accuracy',0):.4f} "
                   f"f1={metrics.get('eval_f1',0):.4f} "
                   f"p={metrics.get('eval_precision',0):.4f} "
                   f"r={metrics.get('eval_recall',0):.4f}")
            print(msg, flush=True)

safe_name = BASE_RUN_NAME
out_dir = os.path.join("hf_ckpts", safe_name)
os.makedirs(out_dir, exist_ok=True)

bf16_ok = torch.cuda.is_available() and torch.cuda.is_bf16_supported()

training_args = TrainingArguments(
    output_dir=out_dir,
    run_name=safe_name,
    report_to=["wandb"],
    learning_rate=LR,
    weight_decay=WEIGHT_DECAY,
    warmup_ratio=WARMUP_RATIO,
    num_train_epochs=EPOCHS,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    gradient_accumulation_steps=GRAD_ACCUM,
    dataloader_num_workers=NUM_WORKERS,
    fp16=not bf16_ok and torch.cuda.is_available(),
    bf16=bf16_ok,
    logging_strategy="steps",
    logging_steps=20,
    eval_strategy="epoch",            # (new name replacing evaluation_strategy)
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    greater_is_better=True,
    save_total_limit=2,
)

wandb_run = wandb.init(
    project=WANDB_PROJECT,
    name=safe_name,
    config={
        "model": MODEL_NAME,
        "max_len": MAX_LEN,
        "batch_size": BATCH_SIZE,
        "epochs": EPOCHS,
        "lr": LR,
        "weight_decay": WEIGHT_DECAY,
        "warmup_ratio": WARMUP_RATIO,
        "grad_accum": GRAD_ACCUM,
        "unfreeze_last_k": UNFREEZE_LAST_K,
    },
    reinit=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_ds,
    eval_dataset=val_ds,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    callbacks=[PrintAndWBCallback(print_every=20)],
)

print(f"[Trainer] Starting fine-tune → output_dir={out_dir}")
train_out = trainer.train()
print(train_out)

# best checkpoint path on disk
best_path = trainer.state.best_model_checkpoint
print("Best checkpoint dir:", best_path)
wandb_run.summary["best_checkpoint_dir"] = best_path
wandb.finish()


Downloading builder script: 4.20kB [00:00, 5.53MB/s]


  trainer = Trainer(


[Trainer] Starting fine-tune → output_dir=hf_ckpts\microsoft__mdeberta-v3-base_ex5_trainer
[Run] epochs=12 bs=16 lr=3.00e-05 wd=5.0e-02 warmup_ratio=0.06 grad_accum=1


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,1.2132,1.232292,0.48275,0.500943,0.50731,0.496841
2,1.0806,1.058841,0.561467,0.58452,0.592781,0.577479
3,0.9622,0.983838,0.59621,0.615364,0.619935,0.611747
4,0.895,0.967778,0.605199,0.623989,0.630168,0.619882
5,0.954,0.9493,0.618319,0.632191,0.64067,0.63215
6,0.9398,0.952293,0.613703,0.634493,0.636548,0.628234
7,0.9532,0.948852,0.611516,0.631063,0.634211,0.626338
8,0.9995,0.942999,0.613703,0.63022,0.638575,0.628589
9,0.9022,0.946464,0.616861,0.636011,0.63837,0.630863
10,0.9534,0.946236,0.615889,0.633949,0.638359,0.630118


[e0 b20/2315] loss=1.6226 lr=3.42e-07
[e0 b40/2315] loss=1.6173 lr=7.02e-07
[e0 b60/2315] loss=1.6075 lr=1.06e-06
[e0 b80/2315] loss=1.6150 lr=1.42e-06
[e0 b100/2315] loss=1.6179 lr=1.78e-06
[e0 b120/2315] loss=1.6180 lr=2.14e-06
[e0 b140/2315] loss=1.6063 lr=2.50e-06
[e0 b160/2315] loss=1.6032 lr=2.86e-06
[e0 b180/2315] loss=1.6073 lr=3.22e-06
[e0 b200/2315] loss=1.6209 lr=3.58e-06
[e0 b220/2315] loss=1.6040 lr=3.94e-06
[e0 b240/2315] loss=1.6119 lr=4.30e-06
[e0 b260/2315] loss=1.6084 lr=4.66e-06
[e0 b280/2315] loss=1.5997 lr=5.02e-06
[e0 b300/2315] loss=1.6078 lr=5.38e-06
[e0 b320/2315] loss=1.6032 lr=5.74e-06
[e0 b340/2315] loss=1.6153 lr=6.10e-06
[e0 b360/2315] loss=1.6092 lr=6.46e-06
[e0 b380/2315] loss=1.6029 lr=6.82e-06
[e0 b400/2315] loss=1.6028 lr=7.18e-06
[e0 b420/2315] loss=1.5912 lr=7.54e-06
[e0 b440/2315] loss=1.5979 lr=7.90e-06
[e0 b460/2315] loss=1.6232 lr=8.26e-06
[e0 b480/2315] loss=1.5998 lr=8.62e-06
[e0 b500/2315] loss=1.6050 lr=8.98e-06
[e0 b520/2315] loss=1.5995 lr

[34m[1mwandb[0m: [32m[41mERROR[0m The nbformat package was not found. It is required to save notebook history.


TrainOutput(global_step=27780, training_loss=1.0007773465952314, metrics={'train_runtime': 1182.5271, 'train_samples_per_second': 375.863, 'train_steps_per_second': 23.492, 'total_flos': 1.887617915188579e+16, 'train_loss': 1.0007773465952314, 'epoch': 12.0})
Best checkpoint dir: hf_ckpts\microsoft__mdeberta-v3-base_ex5_trainer\checkpoint-11575


0,1
eval/accuracy,▁▅▇▇████████
eval/f1,▁▅▇▇████████
eval/loss,█▄▂▂▁▁▁▁▁▁▁▁
eval/precision,▁▅▇▇████████
eval/recall,▁▅▇▇████████
eval/runtime,▂▁▅▅▇▅▄▅▂▇█▂
eval/samples_per_second,▇█▄▄▂▄▅▃▇▂▁▆
eval/steps_per_second,▇█▄▄▂▄▅▃▇▂▁▆
train/epoch,▁▁▁▁▁▃▃▃▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇███
train/global_step,▁▁▁▁▁▁▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▄▅▅▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇█

0,1
best_checkpoint_dir,hf_ckpts\microsoft__...
eval/accuracy,0.6154
eval/f1,0.62948
eval/loss,0.94591
eval/precision,0.63426
eval/recall,0.63706
eval/runtime,4.0429
eval/samples_per_second,1018.092
eval/steps_per_second,63.816
total_flos,1.887617915188579e+16


## HP Tuning Optuna

# 🧪 Ex.5 — Hyperparameter Tuning with HuggingFace `Trainer` (+ Optuna)

Now that the basic HF `Trainer` pipeline works, we’re doing a **focused HP search** directly on top of it.  
Goal: keep the Trainer setup clean and stable while letting **Optuna** explore a *small, sensible* space around our earlier best settings.

---

## 🔧 What we’re tuning (student-style plan)
- **Learning rate**: `1e-5 → 1e-4` (log scale)  
  *Why*: this is the sweet spot we found before; we’ll confirm it under the Trainer loop.
- **Weight decay**: `1e-6 → 1e-4`  
  *Why*: larger values hurt before; we constrain to the “safe” zone.
- **Unfreezing depth**: **8–12** last layers  
  *Why*: deeper unfreezing won in Ex.4; we test if full (12) still dominates here.
- **Batch size**: **{4, 8, 16, 32}**  
  
---

## 🏗️ How we run it
- **Trainer factory per trial** → builds a fresh model and `TrainingArguments` for the sampled HPs.
- **Callbacks**:  
  - Custom print logger (lighter than full tqdm).  
  - **EarlyStoppingCallback** with patience = **4**.  
- **bf16/fp16** auto-detection for speed, dynamic padding (pad to 8), and fused AdamW where possible.
- **W&B**: each trial is a separate run, grouped under a single **study** for easy comparison.

---

## 🎯 What we expect
- Best trials should cluster again around **LR ≈ few×1e-5**, **low weight decay**, and **deep unfreezing** (≥9, often 12).  
- Smaller batches (**8–16**) typically remain more stable than very large ones.

---

## 📦 After the study
We **save the best params to JSON** (`hf_ckpts/best_params_optuna{i}.json`) so we can retrain cleanly later  
(with possibly more epochs) without depending on the Optuna object.
 ---
## Wadb Link
https://wandb.ai/adishalit1-tel-aviv-university/adv-dl-p2-ex-5-deberta-16-08-try

In [2]:
from sklearn.model_selection import train_test_split
import pandas as pd
# Load CSVs (your files have columns: ['UserName','ScreenName','Location','TweetAt','OriginalTweet','Sentiment'])
TRAIN_CSV = "Corona_NLP_train_cleaned_translated.csv"   # or "Corona_NLP_train.csv"
TEST_CSV  = "Corona_NLP_test_cleaned_translated.csv"    # or "Corona_NLP_test.csv"


df_train = pd.read_csv(TRAIN_CSV, encoding="utf-8", engine="python")
df_test  = pd.read_csv(TEST_CSV,  encoding="utf-8", engine="python")

In [4]:
# --- Ex.5: HF Trainer version with W&B + prints (Windows/RTX 4090 friendly) ---

import os, math, random, time, json
from typing import Dict, Any

import numpy as np
import pandas as pd
import torch
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, classification_report, confusion_matrix

import wandb
from transformers import (
    AutoTokenizer, AutoModelForSequenceClassification,
    DataCollatorWithPadding, Trainer, TrainingArguments
)
from transformers.trainer_callback import TrainerCallback, TrainerState, TrainerControl

# ---- Fast CUDA defaults (4090) ----
os.environ.setdefault("PYTORCH_CUDA_ALLOC_CONF", "expandable_segments:True")
if torch.cuda.is_available():
    torch.backends.cuda.matmul.allow_tf32 = True
    torch.backends.cudnn.allow_tf32 = True
    try:
        torch.set_float32_matmul_precision("high")
    except Exception:
        pass

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
SEED = 42
def set_seed(s=42):
    random.seed(s); np.random.seed(s)
    torch.manual_seed(s); torch.cuda.manual_seed_all(s)
    torch.backends.cudnn.deterministic = False
    torch.backends.cudnn.benchmark = True
set_seed(SEED)

# ---- W&B defaults ----
os.environ.setdefault("WANDB_MODE", "online")
os.environ.setdefault("WANDB_PROJECT", "adv-dl-p2-ex-5-deberta-16-08-try")
os.environ.setdefault("WANDB_NOTEBOOK_NAME", "ex5_trainer_new.ipynb")
WANDB_PROJECT = os.environ["WANDB_PROJECT"]

# ---- Constants ----
MODEL_NAME = "microsoft/mdeberta-v3-base"
BASE_RUN_NAME = MODEL_NAME.replace("/", "__") + "_ex5_trainer-try"
MAX_LEN = 512
BATCH_SIZE = 16
EPOCHS = 12
PATIENCE = 4
LR = 3e-5
WEIGHT_DECAY = 0.05
WARMUP_RATIO = 0.06
GRAD_ACCUM = 1
NUM_WORKERS = 0  # Windows-safe; raise to 2 if stable

# ---- Label mapping (5-way) ----
CANON = {
    "extremely negative": "extremely negative",
    "negative": "negative",
    "neutral": "neutral",
    "positive": "positive",
    "extremely positive": "extremely positive",
}
ORDER = ["extremely negative","negative","neutral","positive","extremely positive"]
LABEL2ID = {lab: i for i, lab in enumerate(ORDER)}
ID2LABEL = {i: lab for lab, i in LABEL2ID.items()}

def normalize_label(s: str) -> str:
    s = str(s).strip().lower()
    s = s.replace("very negative", "extremely negative")
    s = s.replace("very positive", "extremely positive")
    s = s.replace("extreme negative", "extremely negative")
    s = s.replace("extreme positive", "extremely positive")
    return CANON.get(s, s)

# ---- Prep dataframes from df_train/df_test already in memory ----
assert "OriginalTweet" in df_train.columns and "Sentiment" in df_train.columns
assert "OriginalTweet" in df_test.columns and "Sentiment" in df_test.columns

def prep_df(df: pd.DataFrame) -> pd.DataFrame:
    df = df.dropna(subset=["OriginalTweet", "Sentiment"]).copy()
    df["text"] = df["OriginalTweet"].astype(str).str.strip()
    df["label_name"] = df["Sentiment"].apply(normalize_label)
    df = df[df["label_name"].isin(ORDER)].reset_index(drop=True)
    df["labels"] = df["label_name"].map(LABEL2ID)
    return df[["text", "labels", "label_name"]]

dftrain_ = prep_df(df_train)
dftest_  = prep_df(df_test)

train_df, val_df = train_test_split(
    dftrain_, test_size=0.1, stratify=dftrain_["labels"], random_state=SEED
)

print(f"Train/Val/Test sizes: {len(train_df)}/{len(val_df)}/{len(dftest_)}")

from datasets import Dataset
from transformers import DataCollatorWithPadding

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)

def tok_fn(batch):
    return tokenizer(batch["text"], padding=False, truncation=True, max_length=MAX_LEN)

def keep_only(ds, cols):
    # Use select_columns if available, else remove the others
    if hasattr(ds, "select_columns"):
        return ds.select_columns(cols)
    drop = [c for c in ds.column_names if c not in cols]
    return ds.remove_columns(drop)

# Build datasets (avoid adding __index_level_0__ with preserve_index=False)
train_ds = Dataset.from_pandas(train_df,  preserve_index=False)
val_ds   = Dataset.from_pandas(val_df,    preserve_index=False)
test_ds  = Dataset.from_pandas(dftest_,   preserve_index=False)

# Keep just text + labels before tokenization
cols_keep = ["text", "labels"]
train_ds = keep_only(train_ds, cols_keep)
val_ds   = keep_only(val_ds,   cols_keep)
test_ds  = keep_only(test_ds,  cols_keep)

# Tokenize
train_ds = train_ds.map(tok_fn, batched=True)
val_ds   = val_ds.map(tok_fn,   batched=True)
test_ds  = test_ds.map(tok_fn,  batched=True)

# Remove raw text after tokenization, keep labels + token IDs
train_ds = train_ds.remove_columns(["text"])
val_ds   = val_ds.remove_columns(["text"])
test_ds  = test_ds.remove_columns(["text"])

# Ensure final columns are exactly what Trainer expects
final_cols = ["input_ids", "attention_mask", "labels"]
train_ds = keep_only(train_ds, final_cols)
val_ds   = keep_only(val_ds,   final_cols)
test_ds  = keep_only(test_ds,  final_cols)

# # --- Speed tip: length stats + length column for bucketing ---
# def _len_map(batch):
#     return {"input_length": [len(x) for x in batch["input_ids"]]}
#
# train_ds = train_ds.map(_len_map, batched=True)
# val_ds   = val_ds.map(_len_map,   batched=True)
# test_ds  = test_ds.map(_len_map,  batched=True)

# Optional: quick distribution print so you can decide a smaller MAX_LEN later
import numpy as np
# lens = np.array(train_ds["input_length"])
# print(f"[len] p50={np.percentile(lens,50)} p90={np.percentile(lens,90)} "
#       f"p95={np.percentile(lens,95)} max={lens.max()}")

data_collator = DataCollatorWithPadding(tokenizer=tokenizer, pad_to_multiple_of=8)

from transformers import AutoModelForSequenceClassification

UNFREEZE_LAST_K = 10  # 1..12 are sensible for DeBERTa-v3-base

def build_model(unfreeze_last_k=UNFREEZE_LAST_K):
    model = AutoModelForSequenceClassification.from_pretrained(
        MODEL_NAME,
        num_labels=len(ORDER),
        id2label=ID2LABEL,
        label2id=LABEL2ID,
        torch_dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else None),
        use_safetensors=True,
    )
    base = getattr(model, "deberta", None) or getattr(model, "roberta", None) or getattr(model, "bert", None)
    if base is not None and hasattr(base, "encoder") and hasattr(base.encoder, "layer"):
        # freeze all
        for p in base.parameters(): p.requires_grad = False
        # unfreeze last k transformer blocks
        for layer in base.encoder.layer[-int(unfreeze_last_k):]:
            for p in layer.parameters(): p.requires_grad = True
    # classifier head always trainable
    for p in model.classifier.parameters(): p.requires_grad = True
    return model.to(DEVICE)

model = build_model()

# # after model = build_model()
# if torch.cuda.is_available():
#     model.gradient_checkpointing_enable()          # compute–memory tradeoff → larger batch


trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
print(f"Trainable params: {trainable:,} / {total:,} ({100*trainable/total:.2f}%) ; unfreeze_last_k={UNFREEZE_LAST_K}")
import evaluate
metric_acc = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = logits.argmax(axis=-1)
    acc = metric_acc.compute(predictions=preds, references=labels)["accuracy"]
    p, r, f1, _ = precision_recall_fscore_support(labels, preds, average="macro", zero_division=0)
    return {"accuracy": acc, "precision": p, "recall": r, "f1": f1}

class PrintAndWBCallback(TrainerCallback):
    def __init__(self, print_every=100):
        self.print_every = print_every
        self.steps_per_epoch = None
        self.last_print_step = -1

    def on_train_begin(self, args, state: TrainerState, control: TrainerControl, **kwargs):
        # steps/epoch computed via train_dataloader length
        trainer = kwargs.get("model")  # not available here; compute later in on_step_begin
        print(f"[Run] epochs={args.num_train_epochs} bs={args.per_device_train_batch_size} "
              f"lr={args.learning_rate:.2e} wd={args.weight_decay:.1e} "
              f"warmup_ratio={args.warmup_ratio} grad_accum={args.gradient_accumulation_steps}")

    def on_step_begin(self, args, state: TrainerState, control: TrainerControl, **kwargs):
        # set steps_per_epoch once we have dataloader info exposed in state
        if self.steps_per_epoch is None and state.max_steps is not None and state.num_train_epochs:
            approx_steps_total = state.max_steps
            self.steps_per_epoch = max(1, int(approx_steps_total / math.ceil(state.num_train_epochs)))

    def on_log(self, args, state: TrainerState, control: TrainerControl, logs=None, **kwargs):
        if logs is None: return
        # mimic: [e1 b123/...] loss=... it/s=...
        if self.steps_per_epoch:
            step_in_epoch = (state.global_step % self.steps_per_epoch) or self.steps_per_epoch
            if step_in_epoch % self.print_every == 0 and state.global_step != self.last_print_step:
                loss = logs.get("loss", logs.get("train_loss", None))
                lr = logs.get("learning_rate", None)
                sps = logs.get("train_samples_per_second", None) or logs.get("samples_per_second", None)
                txt = f"[e{int(state.epoch or 0)} b{step_in_epoch}/{self.steps_per_epoch}]"
                if loss is not None: txt += f" loss={loss:.4f}"
                if lr   is not None: txt += f" lr={lr:.2e}"
                if sps  is not None: txt += f" samp/s={sps:.1f}"
                print(txt, flush=True)
                self.last_print_step = state.global_step

    def on_evaluate(self, args, state: TrainerState, control: TrainerControl, metrics, **kwargs):
        if metrics:
            msg = (f"[val @ epoch {int(state.epoch or 0)}] "
                   f"acc={metrics.get('eval_accuracy',0):.4f} "
                   f"f1={metrics.get('eval_f1',0):.4f} "
                   f"p={metrics.get('eval_precision',0):.4f} "
                   f"r={metrics.get('eval_recall',0):.4f}")
            print(msg, flush=True)





Train/Val/Test sizes: 37039/4116/3798


Map: 100%|██████████| 37039/37039 [00:01<00:00, 28774.57 examples/s]
Map: 100%|██████████| 4116/4116 [00:00<00:00, 25847.50 examples/s]
Map: 100%|██████████| 3798/3798 [00:00<00:00, 24327.51 examples/s]
Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/mdeberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  return t.to(


Trainable params: 71,473,157 / 278,813,189 (25.63%) ; unfreeze_last_k=10


In [3]:
UNFREEZE_LAST_K = 10

In [4]:
    # --- Ex.5: Optuna study (12 trials × 12 epochs) around your current best HPs ---

import optuna
from transformers import TrainingArguments, Trainer, EarlyStoppingCallback

STUDY_GROUP   = BASE_RUN_NAME + "_study"
REFINE_EPOCHS = 8        # <- per your request
N_TRIALS      = 10        # <- per your request
LOG_STEPS     = 100

# Convenience: candidate batch sizes "around" your current BATCH_SIZE, but safe on 4090
def bs_candidates(bs):
    cands = {bs}
    if bs <= 16:
        cands.add(min(32, bs * 2))
    else:
        cands.add(max(8, bs // 2))
    # keep them reasonable
    return sorted({b for b in cands if 4 <= b <= 64})

# Trainer factory for a single trial
def build_trainer_for_trial(lr, wd, k_unfreeze, bs, run_name, epochs):
    model = build_model(unfreeze_last_k=int(k_unfreeze))
    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
    total = sum(p.numel() for p in model.parameters())
    print(f"Trainable params: {trainable:,} / {total:,} ({100*trainable/total:.2f}%) ; unfreeze_last_k={k_unfreeze}")

    # bf16_ok = torch.cuda.is_available() and torch.cuda.is_bf16_supported()
    safe_name = BASE_RUN_NAME
    out_dir = os.path.join("hf_ckpts", safe_name)
    os.makedirs(out_dir, exist_ok=True)

    bf16_ok = torch.cuda.is_available() and torch.cuda.is_bf16_supported()
    args = TrainingArguments(
        output_dir=os.path.join("hf_ckpts", run_name),
        run_name=run_name,
        report_to=["wandb"],
        seed=SEED,
        learning_rate=float(lr),
        weight_decay=float(wd),
        warmup_ratio=WARMUP_RATIO,
        num_train_epochs=int(epochs),
        per_device_train_batch_size=int(bs),
        per_device_eval_batch_size=int(bs),
        gradient_accumulation_steps=GRAD_ACCUM,
        dataloader_num_workers=NUM_WORKERS,
        dataloader_pin_memory=True,
        dataloader_prefetch_factor=2 if NUM_WORKERS > 0 else None,
        fp16=(torch.cuda.is_available() and not bf16_ok),
        bf16=bf16_ok,
        optim="adamw_torch_fused" if torch.cuda.is_available() else "adamw_torch",
        save_safetensors=True,
        disable_tqdm=True,             # we print ourselves
        logging_strategy="steps",
        logging_steps=LOG_STEPS,
        eval_strategy="epoch",
        save_strategy="epoch",
        save_total_limit=1,
        load_best_model_at_end=True,
        metric_for_best_model="f1",
        greater_is_better=True,
    )

    trainer = Trainer(
        model=model,
        args=args,
        train_dataset=train_ds,
        eval_dataset=val_ds,
        tokenizer=tokenizer,
        data_collator=data_collator,
        compute_metrics=compute_metrics,
        callbacks=[
            PrintAndWBCallback(print_every=LOG_STEPS),
            EarlyStoppingCallback(early_stopping_patience=PATIENCE),
        ],
    )
    return trainer

# Objective: search narrowly around your current best HPs
def objective(trial: optuna.trial.Trial):
    lr = trial.suggest_float("lr", 1e-5, 1e-4, log=True)
    wd = trial.suggest_float("weight_decay",1e-6, 1e-4, log=True)
    k  = trial.suggest_int("unfreeze_last_k", 8,12)
    bs = trial.suggest_categorical("batch_size", [4, 8, 16, 32])

    run_name = f"{BASE_RUN_NAME}__t{trial.number}"

    # W&B run per trial (grouped under the study for easy comparison)
    wb = wandb.init(
        project=WANDB_PROJECT,
        name=run_name,
        group=STUDY_GROUP,
        job_type="optuna-trial",
        tags=["optuna","trainer","ex5"],
        config={
            "trial": trial.number, "model": MODEL_NAME,
            "lr": lr, "weight_decay": wd, "unfreeze_last_k": k,
            "batch_size": bs, "epochs": REFINE_EPOCHS,
            "warmup_ratio": WARMUP_RATIO, "max_len": MAX_LEN,
        },
        settings=wandb.Settings(start_method="thread"),
    )

    print(f"[TUNE] trial={trial.number} | epochs={REFINE_EPOCHS} bs={bs} lr={lr:.2e} wd={wd:.1e} k={k}")

    trainer = build_trainer_for_trial(lr, wd, k, bs, run_name, REFINE_EPOCHS)

    # Train + epoch evals (printed by callback; logged to W&B automatically)
    trainer.train()

    # Final eval for the objective score
    metrics = trainer.evaluate()
    f1 = float(metrics.get("eval_f1", 0.0))
    acc = float(metrics.get("eval_accuracy", 0.0))

    # Extra trial-level logging
    wandb.log({"trial/f1": f1, "trial/accuracy": acc})
    wb.summary["best_model_ckpt"] = trainer.state.best_model_checkpoint
    wb.summary["best_eval_f1"] = f1
    wb.summary["params"] = dict(lr=lr, weight_decay=wd, unfreeze_last_k=k, batch_size=bs, epochs=REFINE_EPOCHS)
    wandb.finish()

    print(f"[TUNE-END] trial={trial.number} f1={f1:.4f}")
    return f1

# Run the study
print(f"[Study] Starting Optuna: trials={N_TRIALS}, epochs/trial={REFINE_EPOCHS} ; centered at "
      f"lr={LR:.2e}, wd={WEIGHT_DECAY:.1e}, k={UNFREEZE_LAST_K}, bs={BATCH_SIZE}")
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=N_TRIALS, show_progress_bar=True)

# Persist best params
best = {
    "lr": float(study.best_trial.params["lr"]),
    "weight_decay": float(study.best_trial.params["weight_decay"]),
    "num_unfreeze_last_layers": int(study.best_trial.params["unfreeze_last_k"]),
    "batch_size": int(study.best_trial.params["batch_size"]),
    "epochs": int(EPOCHS),   # you can bump this later for final train if you want
}
os.makedirs("hf_ckpts", exist_ok=True)
with open("hf_ckpts/best_params_optuna1.json", "w") as f:
    json.dump(best, f, indent=2)

print("[Study best]", best)
print("Saved best params → hf_ckpts/best_params_optuna1.json")


[I 2025-08-16 21:41:10,602] A new study created in memory with name: no-name-3b13a031-3415-41ed-8e90-8e6b61be55de


[Study] Starting Optuna: trials=10, epochs/trial=8 ; centered at lr=3.00e-05, wd=5.0e-02, k=10, bs=16


[34m[1mwandb[0m: Currently logged in as: [33madishalit1[0m ([33madishalit1-tel-aviv-university[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


[TUNE] trial=0 | epochs=8 bs=8 lr=1.20e-05 wd=6.1e-06 k=12


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/mdeberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Trainable params: 85,648,901 / 278,813,189 (30.72%) ; unfreeze_last_k=12


  trainer = Trainer(


[Run] epochs=8 bs=8 lr=1.20e-05 wd=6.1e-06 warmup_ratio=0.06 grad_accum=1
[e0 b100/4630] loss=1.6069 lr=5.34e-07
{'loss': 1.6069, 'grad_norm': 3.875, 'learning_rate': 5.336356965817732e-07, 'epoch': 0.02159827213822894}
[e0 b200/4630] loss=1.5959 lr=1.07e-06
{'loss': 1.5959, 'grad_norm': 4.78125, 'learning_rate': 1.0726616527249786e-06, 'epoch': 0.04319654427645788}
[e0 b300/4630] loss=1.6043 lr=1.61e-06
{'loss': 1.6043, 'grad_norm': 3.28125, 'learning_rate': 1.6116876088681837e-06, 'epoch': 0.06479481641468683}
[e0 b400/4630] loss=1.6024 lr=2.15e-06
{'loss': 1.6024, 'grad_norm': 1.953125, 'learning_rate': 2.150713565011389e-06, 'epoch': 0.08639308855291576}
[e0 b500/4630] loss=1.6011 lr=2.69e-06
{'loss': 1.6011, 'grad_norm': 3.890625, 'learning_rate': 2.6897395211545945e-06, 'epoch': 0.1079913606911447}
[e0 b600/4630] loss=1.6014 lr=3.23e-06
{'loss': 1.6014, 'grad_norm': 3.046875, 'learning_rate': 3.2287654772977998e-06, 'epoch': 0.12958963282937366}
[e0 b700/4630] loss=1.5988 lr=3.77

[34m[1mwandb[0m: [32m[41mERROR[0m The nbformat package was not found. It is required to save notebook history.


0,1
eval/accuracy,▁▆▆█▇▇▇▇█
eval/f1,▁▇▆█▇█▇▇█
eval/loss,█▃▂▁▁▁▁▁▁
eval/precision,▁█▇▇███▇▇
eval/recall,▁▇▆█▇▇▇▇█
eval/runtime,▁▄█▁▁▃▃▁▄
eval/samples_per_second,█▄▁██▆▆█▅
eval/steps_per_second,█▄▁██▆▆█▅
train/epoch,▁▁▁▁▁▂▂▂▂▂▃▃▃▃▄▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▆▆▆▆▇▇▇█
train/global_step,▁▁▁▁▂▂▂▂▂▃▃▃▃▃▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇▇▇███

0,1
best_eval_f1,0.2534
best_model_ckpt,hf_ckpts\microsoft__...
eval/accuracy,0.34208
eval/f1,0.2534
eval/loss,1.48863
eval/precision,0.42302
eval/recall,0.28175
eval/runtime,8.4055
eval/samples_per_second,489.679
eval/steps_per_second,61.269


Best trial: 0. Best value: 0.253403:  10%|█         | 1/10 [29:02<4:21:25, 1742.81s/it]

[TUNE-END] trial=0 f1=0.2534
[I 2025-08-16 22:10:13,407] Trial 0 finished with value: 0.25340324101829476 and parameters: {'lr': 1.1982547005063454e-05, 'weight_decay': 6.101718790027908e-06, 'unfreeze_last_k': 12, 'batch_size': 8}. Best is trial 0 with value: 0.25340324101829476.


[TUNE] trial=1 | epochs=8 bs=8 lr=2.86e-05 wd=1.4e-06 k=9


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/mdeberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Trainable params: 64,385,285 / 278,813,189 (23.09%) ; unfreeze_last_k=9
[Run] epochs=8 bs=8 lr=2.86e-05 wd=1.4e-06 warmup_ratio=0.06 grad_accum=1


  trainer = Trainer(


[e0 b100/4630] loss=1.6398 lr=1.27e-06
{'loss': 1.6398, 'grad_norm': 4.0625, 'learning_rate': 1.273024361730603e-06, 'epoch': 0.02159827213822894}
[e0 b200/4630] loss=1.6456 lr=2.56e-06
{'loss': 1.6456, 'grad_norm': 5.09375, 'learning_rate': 2.558907555397879e-06, 'epoch': 0.04319654427645788}
[e0 b300/4630] loss=1.6495 lr=3.84e-06
{'loss': 1.6495, 'grad_norm': 3.546875, 'learning_rate': 3.844790749065155e-06, 'epoch': 0.06479481641468683}
[e0 b400/4630] loss=1.6410 lr=5.13e-06
{'loss': 1.641, 'grad_norm': 2.0625, 'learning_rate': 5.1306739427324315e-06, 'epoch': 0.08639308855291576}
[e0 b500/4630] loss=1.6399 lr=6.42e-06
{'loss': 1.6399, 'grad_norm': 4.375, 'learning_rate': 6.416557136399707e-06, 'epoch': 0.1079913606911447}
[e0 b600/4630] loss=1.6351 lr=7.70e-06
{'loss': 1.6351, 'grad_norm': 3.34375, 'learning_rate': 7.702440330066983e-06, 'epoch': 0.12958963282937366}
[e0 b700/4630] loss=1.6268 lr=8.99e-06
{'loss': 1.6268, 'grad_norm': 4.03125, 'learning_rate': 8.98832352373426e-06,

[34m[1mwandb[0m: [32m[41mERROR[0m The nbformat package was not found. It is required to save notebook history.


0,1
eval/accuracy,▁▄▇▆█████
eval/f1,▁▃▇▆█████
eval/loss,█▅▂▂▁▁▁▁▁
eval/precision,▁▄█▆█████
eval/recall,▁▅▇▆█████
eval/runtime,▁▁▃▂▇██▁▂
eval/samples_per_second,██▆▇▁▁▁█▇
eval/steps_per_second,██▆▇▁▁▁█▇
train/epoch,▁▁▁▂▂▂▂▃▃▃▃▄▄▄▄▄▄▄▄▅▅▅▅▆▆▆▆▆▆▆▆▇▇▇▇▇▇▇██
train/global_step,▁▁▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▅▅▅▅▆▆▆▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇█

0,1
best_eval_f1,0.59566
best_model_ckpt,hf_ckpts\microsoft__...
eval/accuracy,0.58115
eval/f1,0.59566
eval/loss,1.01872
eval/precision,0.60371
eval/recall,0.60642
eval/runtime,8.0287
eval/samples_per_second,512.658
eval/steps_per_second,64.145


Best trial: 1. Best value: 0.595659:  20%|██        | 2/10 [56:20<3:44:06, 1680.75s/it]

[TUNE-END] trial=1 f1=0.5957
[I 2025-08-16 22:37:30,722] Trial 1 finished with value: 0.5956587069208666 and parameters: {'lr': 2.8585183395223546e-05, 'weight_decay': 1.4467147485609183e-06, 'unfreeze_last_k': 9, 'batch_size': 8}. Best is trial 1 with value: 0.5956587069208666.


[TUNE] trial=2 | epochs=8 bs=32 lr=9.03e-05 wd=2.3e-06 k=12


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/mdeberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Trainable params: 85,648,901 / 278,813,189 (30.72%) ; unfreeze_last_k=12
[Run] epochs=8 bs=32 lr=9.03e-05 wd=2.3e-06 warmup_ratio=0.06 grad_accum=1


  trainer = Trainer(


[e0 b100/1158] loss=1.6382 lr=1.61e-05
{'loss': 1.6382, 'grad_norm': 1.5078125, 'learning_rate': 1.6086790006477026e-05, 'epoch': 0.08635578583765112}
[e0 b200/1158] loss=1.5846 lr=3.23e-05
{'loss': 1.5846, 'grad_norm': 4.09375, 'learning_rate': 3.233607284130231e-05, 'epoch': 0.17271157167530224}
[e0 b300/1158] loss=1.4865 lr=4.86e-05
{'loss': 1.4865, 'grad_norm': 7.0, 'learning_rate': 4.8585355676127585e-05, 'epoch': 0.25906735751295334}
[e0 b400/1158] loss=1.3073 lr=6.48e-05
{'loss': 1.3073, 'grad_norm': 12.3125, 'learning_rate': 6.483463851095286e-05, 'epoch': 0.3454231433506045}
[e0 b500/1158] loss=1.1409 lr=8.11e-05
{'loss': 1.1409, 'grad_norm': 17.375, 'learning_rate': 8.108392134577814e-05, 'epoch': 0.4317789291882556}
[e0 b600/1158] loss=0.9982 lr=8.99e-05
{'loss': 0.9982, 'grad_norm': 11.9375, 'learning_rate': 8.989988503060535e-05, 'epoch': 0.5181347150259067}
[e0 b700/1158] loss=0.9290 lr=8.89e-05
{'loss': 0.929, 'grad_norm': 13.375, 'learning_rate': 8.88623791445049e-05, '

[34m[1mwandb[0m: [32m[41mERROR[0m The nbformat package was not found. It is required to save notebook history.


0,1
eval/accuracy,▁▅▆▇█████
eval/f1,▁▅▆▇█████
eval/loss,█▄▂▁▁▁▂▁▁
eval/precision,▁▅▆▇█████
eval/recall,▁▅▆▇█████
eval/runtime,▁▃▆▄▄▃▁▇█
eval/samples_per_second,█▆▃▅▅▅█▂▁
eval/steps_per_second,█▆▃▅▅▅█▂▁
train/epoch,▁▁▂▂▂▂▃▃▃▃▃▃▃▃▃▄▄▄▄▄▄▄▅▅▅▆▆▆▆▆▆▆▇▇▇▇▇███
train/global_step,▁▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▄▄▅▅▆▆▆▆▆▇▇▇▇▇▇████

0,1
best_eval_f1,0.85234
best_model_ckpt,hf_ckpts\microsoft__...
eval/accuracy,0.84791
eval/f1,0.85234
eval/loss,0.5006
eval/precision,0.84757
eval/recall,0.86184
eval/runtime,2.251
eval/samples_per_second,1828.56
eval/steps_per_second,57.309


Best trial: 2. Best value: 0.852341:  30%|███       | 3/10 [1:04:09<2:11:33, 1127.71s/it]

[TUNE-END] trial=2 f1=0.8523
[I 2025-08-16 22:45:20,312] Trial 2 finished with value: 0.8523413656247504 and parameters: {'lr': 9.034601256162855e-05, 'weight_decay': 2.3378864159216274e-06, 'unfreeze_last_k': 12, 'batch_size': 32}. Best is trial 2 with value: 0.8523413656247504.


[TUNE] trial=3 | epochs=8 bs=8 lr=1.33e-05 wd=3.5e-05 k=8


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/mdeberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Trainable params: 57,297,413 / 278,813,189 (20.55%) ; unfreeze_last_k=8
[Run] epochs=8 bs=8 lr=1.33e-05 wd=3.5e-05 warmup_ratio=0.06 grad_accum=1
[e0 b100/4630] loss=1.6396 lr=5.91e-07
{'loss': 1.6396, 'grad_norm': 4.0625, 'learning_rate': 5.910886702551045e-07, 'epoch': 0.02159827213822894}
[e0 b200/4630] loss=1.6463 lr=1.19e-06
{'loss': 1.6463, 'grad_norm': 5.125, 'learning_rate': 1.1881479331390485e-06, 'epoch': 0.04319654427645788}
[e0 b300/4630] loss=1.6510 lr=1.79e-06
{'loss': 1.651, 'grad_norm': 3.578125, 'learning_rate': 1.7852071960229926e-06, 'epoch': 0.06479481641468683}
[e0 b400/4630] loss=1.6430 lr=2.38e-06
{'loss': 1.643, 'grad_norm': 2.09375, 'learning_rate': 2.3822664589069365e-06, 'epoch': 0.08639308855291576}
[e0 b500/4630] loss=1.6438 lr=2.98e-06
{'loss': 1.6438, 'grad_norm': 4.4375, 'learning_rate': 2.979325721790881e-06, 'epoch': 0.1079913606911447}
[e0 b600/4630] loss=1.6423 lr=3.58e-06
{'loss': 1.6423, 'grad_norm': 3.4375, 'learning_rate': 3.5763849846748247e-06,

[34m[1mwandb[0m: [32m[41mERROR[0m The nbformat package was not found. It is required to save notebook history.


0,1
eval/accuracy,▁▄▇▆██▇██
eval/f1,▁▆█▇██▇██
eval/loss,█▃▂▁▁▁▁▁▁
eval/precision,▁▄▇▄▆▇▆█▆
eval/recall,▁▅█▇██▇██
eval/runtime,▂▄▁▄▃█▄▃▆
eval/samples_per_second,▇▄█▅▆▁▅▆▃
eval/steps_per_second,▇▄█▅▆▁▅▆▃
train/epoch,▁▁▁▁▁▁▂▂▂▂▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇█████
train/global_step,▁▁▁▂▂▂▂▂▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇▇███

0,1
best_eval_f1,0.28886
best_model_ckpt,hf_ckpts\microsoft__...
eval/accuracy,0.35374
eval/f1,0.28886
eval/loss,1.46994
eval/precision,0.43183
eval/recall,0.30144
eval/runtime,7.8713
eval/samples_per_second,522.91
eval/steps_per_second,65.427


Best trial: 2. Best value: 0.852341:  40%|████      | 4/10 [1:27:31<2:03:36, 1236.02s/it]

[TUNE-END] trial=3 f1=0.2889
[I 2025-08-16 23:08:42,361] Trial 3 finished with value: 0.2888584099189607 and parameters: {'lr': 1.3272627413910076e-05, 'weight_decay': 3.5086905570042316e-05, 'unfreeze_last_k': 8, 'batch_size': 8}. Best is trial 2 with value: 0.8523413656247504.


[TUNE] trial=4 | epochs=8 bs=8 lr=4.68e-05 wd=1.3e-05 k=10


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/mdeberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Trainable params: 71,473,157 / 278,813,189 (25.63%) ; unfreeze_last_k=10


  trainer = Trainer(


[Run] epochs=8 bs=8 lr=4.68e-05 wd=1.3e-05 warmup_ratio=0.06 grad_accum=1
[e0 b100/4630] loss=1.6396 lr=2.08e-06
{'loss': 1.6396, 'grad_norm': 4.0625, 'learning_rate': 2.0841980876303363e-06, 'epoch': 0.02159827213822894}
[e0 b200/4630] loss=1.6452 lr=4.19e-06
{'loss': 1.6452, 'grad_norm': 5.0625, 'learning_rate': 4.189448681196333e-06, 'epoch': 0.04319654427645788}
[e0 b300/4630] loss=1.6473 lr=6.29e-06
{'loss': 1.6473, 'grad_norm': 3.796875, 'learning_rate': 6.294699274762329e-06, 'epoch': 0.06479481641468683}
[e0 b400/4630] loss=1.6355 lr=8.40e-06
{'loss': 1.6355, 'grad_norm': 2.09375, 'learning_rate': 8.399949868328326e-06, 'epoch': 0.08639308855291576}
[e0 b500/4630] loss=1.6321 lr=1.05e-05
{'loss': 1.6321, 'grad_norm': 4.34375, 'learning_rate': 1.0505200461894322e-05, 'epoch': 0.1079913606911447}
[e0 b600/4630] loss=1.6199 lr=1.26e-05
{'loss': 1.6199, 'grad_norm': 3.546875, 'learning_rate': 1.2610451055460319e-05, 'epoch': 0.12958963282937366}
[e0 b700/4630] loss=1.5924 lr=1.47e-

[34m[1mwandb[0m: [32m[41mERROR[0m The nbformat package was not found. It is required to save notebook history.


0,1
eval/accuracy,▁▆▇▇█████
eval/f1,▁▆▇▇█████
eval/loss,█▃▂▂▂▁▁▁▂
eval/precision,▁▅▆▇█▇███
eval/recall,▁▇▇██████
eval/runtime,▄▄▄▃▆▁▃▅█
eval/samples_per_second,▅▅▅▆▃█▆▄▁
eval/steps_per_second,▅▅▅▆▃█▆▄▁
train/epoch,▁▁▁▁▂▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▆▆▆▆▆▇▇▇▇▇█████
train/global_step,▁▁▁▁▁▂▂▂▂▂▃▃▃▃▃▄▄▄▄▅▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇▇██

0,1
best_eval_f1,0.74459
best_model_ckpt,hf_ckpts\microsoft__...
eval/accuracy,0.73591
eval/f1,0.74459
eval/loss,0.72212
eval/precision,0.74655
eval/recall,0.75284
eval/runtime,8.2033
eval/samples_per_second,501.752
eval/steps_per_second,62.78


Best trial: 2. Best value: 0.852341:  50%|█████     | 5/10 [1:53:41<1:53:02, 1356.52s/it]

[TUNE-END] trial=4 f1=0.7446
[I 2025-08-16 23:34:52,535] Trial 4 finished with value: 0.7445851369147556 and parameters: {'lr': 4.67997206949721e-05, 'weight_decay': 1.253691919878875e-05, 'unfreeze_last_k': 10, 'batch_size': 8}. Best is trial 2 with value: 0.8523413656247504.


[TUNE] trial=5 | epochs=8 bs=16 lr=9.51e-05 wd=2.0e-05 k=11


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/mdeberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Trainable params: 78,561,029 / 278,813,189 (28.18%) ; unfreeze_last_k=11


  trainer = Trainer(


[Run] epochs=8 bs=16 lr=9.51e-05 wd=2.0e-05 warmup_ratio=0.06 grad_accum=1
[e0 b100/2315] loss=1.6455 lr=8.47e-06
{'loss': 1.6455, 'grad_norm': 2.359375, 'learning_rate': 8.46672601852361e-06, 'epoch': 0.04319654427645788}
[e0 b200/2315] loss=1.6320 lr=1.70e-05
{'loss': 1.632, 'grad_norm': 1.5390625, 'learning_rate': 1.7018974522082814e-05, 'epoch': 0.08639308855291576}
[e0 b300/2315] loss=1.5935 lr=2.56e-05
{'loss': 1.5935, 'grad_norm': 4.40625, 'learning_rate': 2.5571223025642015e-05, 'epoch': 0.12958963282937366}
[e0 b400/2315] loss=1.5452 lr=3.41e-05
{'loss': 1.5452, 'grad_norm': 6.0625, 'learning_rate': 3.4123471529201215e-05, 'epoch': 0.17278617710583152}
[e0 b500/2315] loss=1.5109 lr=4.27e-05
{'loss': 1.5109, 'grad_norm': 19.25, 'learning_rate': 4.267572003276042e-05, 'epoch': 0.2159827213822894}
[e0 b600/2315] loss=1.4298 lr=5.12e-05
{'loss': 1.4298, 'grad_norm': 13.9375, 'learning_rate': 5.122796853631962e-05, 'epoch': 0.2591792656587473}
[e0 b700/2315] loss=1.3100 lr=5.98e-05

[34m[1mwandb[0m: [32m[41mERROR[0m The nbformat package was not found. It is required to save notebook history.


0,1
eval/accuracy,▁▆▆██████
eval/f1,▁▆▆██████
eval/loss,█▂▂▁▁▂▂▂▁
eval/precision,▁▇▆██████
eval/recall,▁▆▇██████
eval/runtime,▁▁█▁▁▁▁▁▁
eval/samples_per_second,▇█▁█▇▇▇█▇
eval/steps_per_second,▇█▁█▇▇▇█▇
train/epoch,▁▁▁▁▁▂▂▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇█
train/global_step,▁▁▁▁▂▂▂▂▃▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▆▆▆▆▇▇▇██

0,1
best_eval_f1,0.85213
best_model_ckpt,hf_ckpts\microsoft__...
eval/accuracy,0.84815
eval/f1,0.85213
eval/loss,0.49958
eval/precision,0.84803
eval/recall,0.85902
eval/runtime,4.2163
eval/samples_per_second,976.211
eval/steps_per_second,61.191


Best trial: 2. Best value: 0.852341:  60%|██████    | 6/10 [2:08:17<1:19:32, 1193.08s/it]

[TUNE-END] trial=5 f1=0.8521
[I 2025-08-16 23:49:28,362] Trial 5 finished with value: 0.8521296468590419 and parameters: {'lr': 9.510100335957833e-05, 'weight_decay': 2.0408663465951878e-05, 'unfreeze_last_k': 11, 'batch_size': 16}. Best is trial 2 with value: 0.8523413656247504.


[TUNE] trial=6 | epochs=8 bs=4 lr=1.44e-05 wd=6.3e-05 k=9


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/mdeberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Trainable params: 64,385,285 / 278,813,189 (23.09%) ; unfreeze_last_k=9
[Run] epochs=8 bs=4 lr=1.44e-05 wd=6.3e-05 warmup_ratio=0.06 grad_accum=1
[e0 b100/9260] loss=1.6413 lr=3.21e-07
{'loss': 1.6413, 'grad_norm': 6.03125, 'learning_rate': 3.207353172932416e-07, 'epoch': 0.01079913606911447}
[e0 b200/9260] loss=1.6504 lr=6.45e-07
{'loss': 1.6504, 'grad_norm': 2.84375, 'learning_rate': 6.44710385266213e-07, 'epoch': 0.02159827213822894}
[e0 b300/9260] loss=1.6477 lr=9.69e-07
{'loss': 1.6477, 'grad_norm': 7.1875, 'learning_rate': 9.68685453239184e-07, 'epoch': 0.032397408207343416}
[e0 b400/9260] loss=1.6408 lr=1.29e-06
{'loss': 1.6408, 'grad_norm': 4.8125, 'learning_rate': 1.2926605212121555e-06, 'epoch': 0.04319654427645788}
[e0 b500/9260] loss=1.6511 lr=1.62e-06
{'loss': 1.6511, 'grad_norm': 6.125, 'learning_rate': 1.6166355891851268e-06, 'epoch': 0.05399568034557235}
[e0 b600/9260] loss=1.6474 lr=1.94e-06
{'loss': 1.6474, 'grad_norm': 4.875, 'learning_rate': 1.940610657158098e-06, '

[34m[1mwandb[0m: [32m[41mERROR[0m The nbformat package was not found. It is required to save notebook history.


0,1
eval/accuracy,▁█▆▇▇█▇▇▇
eval/f1,▁▇▇█▇█▇▇█
eval/loss,█▃▂▁▁▁▁▁▁
eval/precision,██▁▂▂▅▂▅▂
eval/recall,▁█▇██████
eval/runtime,▁▁▁▂▄▃█▁▄
eval/samples_per_second,███▇▅▆▁█▅
eval/steps_per_second,███▇▅▆▁█▅
train/epoch,▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▅▅▅▅▅▅▅▅▆▆▆▆▆▆▆▆▆▇█████
train/global_step,▁▁▂▂▂▂▂▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▅▆▆▆▆▆▆▆▆▇▇▇▇▇▇███

0,1
best_eval_f1,0.25642
best_model_ckpt,hf_ckpts\microsoft__...
eval/accuracy,0.33965
eval/f1,0.25642
eval/loss,1.48475
eval/precision,0.32535
eval/recall,0.2806
eval/runtime,16.145
eval/samples_per_second,254.94
eval/steps_per_second,63.735


Best trial: 2. Best value: 0.852341:  70%|███████   | 7/10 [2:59:14<1:30:06, 1802.22s/it]

[TUNE-END] trial=6 f1=0.2564
[I 2025-08-17 00:40:24,677] Trial 6 finished with value: 0.2564158769054323 and parameters: {'lr': 1.4400691771398575e-05, 'weight_decay': 6.276275158132312e-05, 'unfreeze_last_k': 9, 'batch_size': 4}. Best is trial 2 with value: 0.8523413656247504.


[TUNE] trial=7 | epochs=8 bs=16 lr=1.68e-05 wd=7.9e-05 k=11


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/mdeberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Trainable params: 78,561,029 / 278,813,189 (28.18%) ; unfreeze_last_k=11


  trainer = Trainer(


[Run] epochs=8 bs=16 lr=1.68e-05 wd=7.9e-05 warmup_ratio=0.06 grad_accum=1
[e0 b100/2315] loss=1.6465 lr=1.50e-06
{'loss': 1.6465, 'grad_norm': 2.4375, 'learning_rate': 1.4951136414167308e-06, 'epoch': 0.04319654427645788}
[e0 b200/2315] loss=1.6467 lr=3.01e-06
{'loss': 1.6467, 'grad_norm': 1.7109375, 'learning_rate': 3.0053294408275697e-06, 'epoch': 0.08639308855291576}
[e0 b300/2315] loss=1.6431 lr=4.52e-06
{'loss': 1.6431, 'grad_norm': 2.46875, 'learning_rate': 4.5155452402384084e-06, 'epoch': 0.12958963282937366}
[e0 b400/2315] loss=1.6357 lr=6.03e-06
{'loss': 1.6357, 'grad_norm': 3.453125, 'learning_rate': 6.025761039649248e-06, 'epoch': 0.17278617710583152}
[e0 b500/2315] loss=1.6334 lr=7.54e-06
{'loss': 1.6334, 'grad_norm': 3.078125, 'learning_rate': 7.535976839060087e-06, 'epoch': 0.2159827213822894}
[e0 b600/2315] loss=1.6261 lr=9.05e-06
{'loss': 1.6261, 'grad_norm': 3.671875, 'learning_rate': 9.046192638470926e-06, 'epoch': 0.2591792656587473}
[e0 b700/2315] loss=1.6087 lr=1.

[34m[1mwandb[0m: [32m[41mERROR[0m The nbformat package was not found. It is required to save notebook history.


0,1
eval/accuracy,▁▆▇▇█████
eval/f1,▁▆▇▇█████
eval/loss,█▃▂▁▁▁▁▁▁
eval/precision,▁▆▆▇█▇███
eval/recall,▁▆▇▇█████
eval/runtime,▂▄▄▁▃▅▂▇█
eval/samples_per_second,▇▅▅█▆▄▇▂▁
eval/steps_per_second,▇▅▅█▆▄▇▂▁
train/epoch,▁▁▁▂▂▂▂▂▂▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇▇▇▇██
train/global_step,▁▁▁▂▂▂▂▂▃▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇███

0,1
best_eval_f1,0.47448
best_model_ckpt,hf_ckpts\microsoft__...
eval/accuracy,0.46307
eval/f1,0.47448
eval/loss,1.25474
eval/precision,0.49131
eval/recall,0.47965
eval/runtime,4.3958
eval/samples_per_second,936.348
eval/steps_per_second,58.692


Best trial: 2. Best value: 0.852341:  80%|████████  | 8/10 [3:13:07<49:47, 1493.93s/it]  

[TUNE-END] trial=7 f1=0.4745
[I 2025-08-17 00:54:18,511] Trial 7 finished with value: 0.4744787085543057 and parameters: {'lr': 1.679359968944853e-05, 'weight_decay': 7.888518722036515e-05, 'unfreeze_last_k': 11, 'batch_size': 16}. Best is trial 2 with value: 0.8523413656247504.


[TUNE] trial=8 | epochs=8 bs=32 lr=2.58e-05 wd=5.9e-05 k=10


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/mdeberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Trainable params: 71,473,157 / 278,813,189 (25.63%) ; unfreeze_last_k=10
[Run] epochs=8 bs=32 lr=2.58e-05 wd=5.9e-05 warmup_ratio=0.06 grad_accum=1
[e0 b100/1158] loss=1.6438 lr=4.59e-06
{'loss': 1.6438, 'grad_norm': 1.640625, 'learning_rate': 4.5874185501851445e-06, 'epoch': 0.08635578583765112}
[e0 b200/1158] loss=1.6380 lr=9.22e-06
{'loss': 1.638, 'grad_norm': 2.125, 'learning_rate': 9.22117466148327e-06, 'epoch': 0.17271157167530224}
[e0 b300/1158] loss=1.6145 lr=1.39e-05
{'loss': 1.6145, 'grad_norm': 4.90625, 'learning_rate': 1.3854930772781394e-05, 'epoch': 0.25906735751295334}
[e0 b400/1158] loss=1.5627 lr=1.85e-05
{'loss': 1.5627, 'grad_norm': 5.84375, 'learning_rate': 1.848868688407952e-05, 'epoch': 0.3454231433506045}
[e0 b500/1158] loss=1.5340 lr=2.31e-05
{'loss': 1.534, 'grad_norm': 5.125, 'learning_rate': 2.3122442995377645e-05, 'epoch': 0.4317789291882556}
[e0 b600/1158] loss=1.5147 lr=2.56e-05
{'loss': 1.5147, 'grad_norm': 5.9375, 'learning_rate': 2.563646321502691e-05, 

[34m[1mwandb[0m: [32m[41mERROR[0m The nbformat package was not found. It is required to save notebook history.


0,1
eval/accuracy,▁▅▇▇█████
eval/f1,▁▅▇▇█████
eval/loss,█▄▂▁▁▁▁▁▁
eval/precision,▁▅▇▇█████
eval/recall,▁▅▇██████
eval/runtime,▄▂▅▁▄▄█▆▄
eval/samples_per_second,▅▇▄█▅▅▁▃▅
eval/steps_per_second,▅▇▄█▅▅▁▃▅
train/epoch,▁▁▁▂▂▂▂▂▃▃▃▃▃▃▃▄▄▄▄▄▄▄▅▅▅▆▆▆▆▆▆▆▇▇▇▇▇███
train/global_step,▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███

0,1
best_eval_f1,0.59228
best_model_ckpt,hf_ckpts\microsoft__...
eval/accuracy,0.57847
eval/f1,0.59228
eval/loss,1.02346
eval/precision,0.59627
eval/recall,0.60554
eval/runtime,2.1365
eval/samples_per_second,1926.553
eval/steps_per_second,60.38


Best trial: 2. Best value: 0.852341:  90%|█████████ | 9/10 [3:20:16<19:20, 1160.84s/it]

[TUNE-END] trial=8 f1=0.5923
[I 2025-08-17 01:01:26,944] Trial 8 finished with value: 0.5922805669465226 and parameters: {'lr': 2.5763683978817578e-05, 'weight_decay': 5.946546881683547e-05, 'unfreeze_last_k': 10, 'batch_size': 32}. Best is trial 2 with value: 0.8523413656247504.


[TUNE] trial=9 | epochs=8 bs=4 lr=4.21e-05 wd=1.9e-06 k=12


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/mdeberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Trainable params: 85,648,901 / 278,813,189 (30.72%) ; unfreeze_last_k=12
[Run] epochs=8 bs=4 lr=4.21e-05 wd=1.9e-06 warmup_ratio=0.06 grad_accum=1
[e0 b100/9260] loss=1.6410 lr=9.37e-07
{'loss': 1.641, 'grad_norm': 6.0625, 'learning_rate': 9.368652170568546e-07, 'epoch': 0.01079913606911447}
[e0 b200/9260] loss=1.6500 lr=1.88e-06
{'loss': 1.65, 'grad_norm': 2.859375, 'learning_rate': 1.8831937191344855e-06, 'epoch': 0.02159827213822894}
[e0 b300/9260] loss=1.6472 lr=2.83e-06
{'loss': 1.6472, 'grad_norm': 7.15625, 'learning_rate': 2.8295222212121163e-06, 'epoch': 0.032397408207343416}
[e0 b400/9260] loss=1.6394 lr=3.78e-06
{'loss': 1.6394, 'grad_norm': 4.84375, 'learning_rate': 3.7758507232897473e-06, 'epoch': 0.04319654427645788}
[e0 b500/9260] loss=1.6493 lr=4.72e-06
{'loss': 1.6493, 'grad_norm': 6.15625, 'learning_rate': 4.722179225367378e-06, 'epoch': 0.05399568034557235}
[e0 b600/9260] loss=1.6426 lr=5.67e-06
{'loss': 1.6426, 'grad_norm': 4.8125, 'learning_rate': 5.668507727445009e

[34m[1mwandb[0m: [32m[41mERROR[0m The nbformat package was not found. It is required to save notebook history.


0,1
eval/accuracy,▁▅▇▇█████
eval/f1,▁▅▇▇█████
eval/loss,█▄▂▃▁▂▂▂▁
eval/precision,▁▆▇▇█████
eval/recall,▁▅▇██████
eval/runtime,▃▃▆▃▆█▄▁▆
eval/samples_per_second,▆▆▂▅▃▁▅█▃
eval/steps_per_second,▆▆▂▅▃▁▅█▃
train/epoch,▁▁▁▁▂▂▂▂▃▃▃▃▃▃▃▃▃▄▄▄▄▄▄▄▅▅▅▅▆▆▆▆▇▇▇▇████
train/global_step,▁▁▁▂▂▂▂▂▃▃▃▃▄▄▄▄▄▄▅▅▅▅▆▆▆▆▆▆▆▆▇▇▇▇▇▇▇▇██

0,1
best_eval_f1,0.73337
best_model_ckpt,hf_ckpts\microsoft__...
eval/accuracy,0.72352
eval/f1,0.73337
eval/loss,0.78756
eval/precision,0.7341
eval/recall,0.74231
eval/runtime,15.8965
eval/samples_per_second,258.924
eval/steps_per_second,64.731


Best trial: 2. Best value: 0.852341: 100%|██████████| 10/10 [4:18:17<00:00, 1549.73s/it]

[TUNE-END] trial=9 f1=0.7334
[I 2025-08-17 01:59:27,931] Trial 9 finished with value: 0.7333747370018296 and parameters: {'lr': 4.2064301917350694e-05, 'weight_decay': 1.935643819405686e-06, 'unfreeze_last_k': 12, 'batch_size': 4}. Best is trial 2 with value: 0.8523413656247504.
[Study best] {'lr': 9.034601256162855e-05, 'weight_decay': 2.3378864159216274e-06, 'num_unfreeze_last_layers': 12, 'batch_size': 32, 'epochs': 12}
Saved best params → hf_ckpts/best_params_optuna1.json





# 📊 Ex.5 — Optuna Trials: First Results  

We ran **10 trials** with Optuna on top of the HuggingFace Trainer. The goal was to see if the “sweet spot” we found earlier (low LR, deep unfreezing) would still hold.  

---

### 🔑 Main Observations
- **Learning rate**:  
  The good runs are still **between `1e-5` and `9e-5`**.  
  Too small (≈`1e-5`) → model struggles, F1 < 0.6.  
  Too large (≥`1e-4`) → model basically collapses (trial 10 got F1 ≈ 0.08 😬).  

- **Unfreezing**:  
  Best trials need **10–12 layers unfrozen**.  
  When we unfreeze fewer layers (≤9), performance drops sharply.  

- **Overall performance**:  
  Only **two trials (2 and 5)** crossed **0.85 F1**, while most others stayed in the 0.59–0.74 range.  
  So we’re **on the right track** but not as strong/consistent as before.  

---

### 🏆 Best Trials
- **Trial 2** → LR ≈ 9e-5, WD ≈ 2e-6, batch size = 32, unfreeze = 12 → **F1 = 0.852**  
- **Trial 5** → LR ≈ 9.5e-5, WD ≈ 2e-5, batch size = 16, unfreeze = 11 → **F1 = 0.852**  

Both show that **higher LR (but still <1e-4)** + **deep unfreezing** works best.  

---

### 📌 Takeaways
- The **sweet spot (LR, unfreezing)** is consistent with earlier experiments.  
- But the results are **less stable** this time → some configs just collapse.  
- We should:  
  - Run **more trials (20–30)** to confirm the pattern.  
  - Maybe tune **warmup ratio/scheduler** for stability.  
  
  


# Another try with higher lr

In [4]:
    # --- Ex.5: Optuna study (12 trials × 12 epochs) around your current best HPs ---

import optuna
from transformers import TrainingArguments, Trainer, EarlyStoppingCallback

STUDY_GROUP   = BASE_RUN_NAME + "_study"
REFINE_EPOCHS = 8        # <- per your request
N_TRIALS      = 15        # <- per your request
LOG_STEPS     = 100



# Objective: search narrowly around your current best HPs
def objective(trial: optuna.trial.Trial):
    lr = trial.suggest_float("lr", 7e-5, 5e-4, log=True)
    wd = trial.suggest_float("weight_decay",1e-5, 1e-4, log=True)
    k  = trial.suggest_int("unfreeze_last_k", 8,12)
    bs = trial.suggest_categorical("batch_size", [8, 16, 32,64,128])

    run_name = f"{BASE_RUN_NAME}__t{trial.number+10}"

    # W&B run per trial (grouped under the study for easy comparison)
    wb = wandb.init(
        project=WANDB_PROJECT,
        name=run_name,
        group=STUDY_GROUP,
        job_type="optuna-trial",
        tags=["optuna","trainer","ex5"],
        config={
            "trial": trial.number+10, "model": MODEL_NAME,
            "lr": lr, "weight_decay": wd, "unfreeze_last_k": k,
            "batch_size": bs, "epochs": REFINE_EPOCHS,
            "warmup_ratio": WARMUP_RATIO, "max_len": MAX_LEN,
        },
        settings=wandb.Settings(start_method="thread"),
    )

    print(f"[TUNE] trial={trial.number+10} | epochs={REFINE_EPOCHS} bs={bs} lr={lr:.2e} wd={wd:.1e} k={k}")

    trainer = build_trainer_for_trial(lr, wd, k, bs, run_name, REFINE_EPOCHS)

    # Train + epoch evals (printed by callback; logged to W&B automatically)
    trainer.train()

    # Final eval for the objective score
    metrics = trainer.evaluate()
    f1 = float(metrics.get("eval_f1", 0.0))
    acc = float(metrics.get("eval_accuracy", 0.0))

    # Extra trial-level logging
    wandb.log({"trial/f1": f1, "trial/accuracy": acc})
    wb.summary["best_model_ckpt"] = trainer.state.best_model_checkpoint
    wb.summary["best_eval_f1"] = f1
    wb.summary["params"] = dict(lr=lr, weight_decay=wd, unfreeze_last_k=k, batch_size=bs, epochs=REFINE_EPOCHS)
    wandb.finish()

    print(f"[TUNE-END] trial={trial.number+10} f1={f1:.4f}")
    return f1

# Run the study
print(f"[Study] Starting Optuna: trials={N_TRIALS}, epochs/trial={REFINE_EPOCHS} ; centered at "
      f"lr={LR:.2e}, wd={WEIGHT_DECAY:.1e}, k={UNFREEZE_LAST_K}, bs={BATCH_SIZE}")
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=N_TRIALS, show_progress_bar=True)

# Persist best params
best = {
    "lr": float(study.best_trial.params["lr"]),
    "weight_decay": float(study.best_trial.params["weight_decay"]),
    "num_unfreeze_last_layers": int(study.best_trial.params["unfreeze_last_k"]),
    "batch_size": int(study.best_trial.params["batch_size"]),
    "epochs": int(EPOCHS),   # you can bump this later for final train if you want
}
os.makedirs("hf_ckpts", exist_ok=True)
with open("hf_ckpts/best_params_optuna_1closer.json", "w") as f:
    json.dump(best, f, indent=2)

print("[Study best]", best)
print("Saved best params → hf_ckpts/best_params_optuna1closer.json")


[I 2025-08-17 04:34:57,147] A new study created in memory with name: no-name-4f0d5460-fa46-45df-86f5-9b8bd36a09cf


[Study] Starting Optuna: trials=15, epochs/trial=8 ; centered at lr=3.00e-05, wd=5.0e-02, k=10, bs=16


[34m[1mwandb[0m: Currently logged in as: [33madishalit1[0m ([33madishalit1-tel-aviv-university[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


[TUNE] trial=10 | epochs=8 bs=16 lr=3.88e-04 wd=9.1e-06 k=11


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/mdeberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Trainable params: 78,561,029 / 278,813,189 (28.18%) ; unfreeze_last_k=11


  trainer = Trainer(


[Run] epochs=8 bs=16 lr=3.88e-04 wd=9.1e-06 warmup_ratio=0.06 grad_accum=1
[e0 b100/2315] loss=1.5958 lr=3.45e-05
{'loss': 1.5958, 'grad_norm': 1.8046875, 'learning_rate': 3.451588945865563e-05, 'epoch': 0.04319654427645788}
[e0 b200/2315] loss=1.5764 lr=6.94e-05
{'loss': 1.5764, 'grad_norm': 3.734375, 'learning_rate': 6.938042426537849e-05, 'epoch': 0.08639308855291576}
[e0 b300/2315] loss=1.4845 lr=1.04e-04
{'loss': 1.4845, 'grad_norm': 6.0, 'learning_rate': 0.00010424495907210135, 'epoch': 0.12958963282937366}
[e0 b400/2315] loss=1.3034 lr=1.39e-04
{'loss': 1.3034, 'grad_norm': 7.46875, 'learning_rate': 0.0001391094938788242, 'epoch': 0.17278617710583152}
[e0 b500/2315] loss=1.2625 lr=1.74e-04
{'loss': 1.2625, 'grad_norm': 9.5, 'learning_rate': 0.00017397402868554707, 'epoch': 0.2159827213822894}
[e0 b600/2315] loss=1.5507 lr=2.09e-04
{'loss': 1.5507, 'grad_norm': 3.28125, 'learning_rate': 0.00020883856349226992, 'epoch': 0.2591792656587473}
[e0 b700/2315] loss=1.5851 lr=2.44e-04
{'

[34m[1mwandb[0m: [32m[41mERROR[0m The nbformat package was not found. It is required to save notebook history.


0,1
eval/accuracy,▁▁▁▁▁▁
eval/f1,▁▁▁▁▁▁
eval/loss,█▂▃▂▁█
eval/precision,▁▁▁▁▁▁
eval/recall,▁▁▁▁▁▁
eval/runtime,█▂▅▃▁▂
eval/samples_per_second,▁▇▃▆█▇
eval/steps_per_second,▁▇▃▆█▇
train/epoch,▁▁▁▂▂▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▄▄▄▄▅▅▅▅▆▆▆▇▇▇▇████
train/global_step,▁▁▁▂▂▂▂▂▃▃▃▃▄▄▄▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▇▇▇▇▇████

0,1
best_eval_f1,0.08688
best_model_ckpt,hf_ckpts\microsoft__...
eval/accuracy,0.27745
eval/f1,0.08688
eval/loss,1.57646
eval/precision,0.05549
eval/recall,0.2
eval/runtime,3.9007
eval/samples_per_second,1055.198
eval/steps_per_second,66.142


Best trial: 0. Best value: 0.0868771:   7%|▋         | 1/15 [08:28<1:58:45, 508.98s/it]

[TUNE-END] trial=10 f1=0.0869
[I 2025-08-17 04:43:26,129] Trial 0 finished with value: 0.08687713959680486 and parameters: {'lr': 0.0003876936270507582, 'weight_decay': 9.064634974566982e-06, 'unfreeze_last_k': 11, 'batch_size': 16}. Best is trial 0 with value: 0.08687713959680486.


[TUNE] trial=11 | epochs=8 bs=8 lr=3.22e-04 wd=4.9e-05 k=9


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/mdeberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Trainable params: 64,385,285 / 278,813,189 (23.09%) ; unfreeze_last_k=9


  trainer = Trainer(


[Run] epochs=8 bs=8 lr=3.22e-04 wd=4.9e-05 warmup_ratio=0.06 grad_accum=1
[e0 b100/4630] loss=1.6171 lr=1.44e-05
{'loss': 1.6171, 'grad_norm': 3.546875, 'learning_rate': 1.4360509850171916e-05, 'epoch': 0.02159827213822894}
[e0 b200/4630] loss=1.5949 lr=2.89e-05
{'loss': 1.5949, 'grad_norm': 5.0625, 'learning_rate': 2.8866075355396078e-05, 'epoch': 0.04319654427645788}
[e0 b300/4630] loss=1.5492 lr=4.34e-05
{'loss': 1.5492, 'grad_norm': 9.8125, 'learning_rate': 4.3371640860620236e-05, 'epoch': 0.06479481641468683}
[e0 b400/4630] loss=1.5356 lr=5.79e-05
{'loss': 1.5356, 'grad_norm': 6.21875, 'learning_rate': 5.78772063658444e-05, 'epoch': 0.08639308855291576}
[e0 b500/4630] loss=1.5311 lr=7.24e-05
{'loss': 1.5311, 'grad_norm': 5.03125, 'learning_rate': 7.238277187106855e-05, 'epoch': 0.1079913606911447}
[e0 b600/4630] loss=1.6071 lr=8.69e-05
{'loss': 1.6071, 'grad_norm': 4.09375, 'learning_rate': 8.688833737629271e-05, 'epoch': 0.12958963282937366}
[e0 b700/4630] loss=1.5662 lr=1.01e-04

[34m[1mwandb[0m: [32m[41mERROR[0m The nbformat package was not found. It is required to save notebook history.


0,1
eval/accuracy,▁▁▁▁▁▁
eval/f1,▁▁▁▁▁▁
eval/loss,█▁▂▁▁█
eval/precision,▁▁▁▁▁▁
eval/recall,▁▁▁▁▁▁
eval/runtime,▂▅█▁▂▂
eval/samples_per_second,▇▄▁█▇▇
eval/steps_per_second,▇▄▁█▇▇
train/epoch,▁▁▁▂▂▂▂▂▃▄▄▄▄▄▄▅▅▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇▇▇████
train/global_step,▁▁▂▂▂▃▃▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▅▆▆▆▆▆▆▆▇▇▇▇████

0,1
best_eval_f1,0.08688
best_model_ckpt,hf_ckpts\microsoft__...
eval/accuracy,0.27745
eval/f1,0.08688
eval/loss,1.57901
eval/precision,0.05549
eval/recall,0.2
eval/runtime,7.5962
eval/samples_per_second,541.849
eval/steps_per_second,67.797


Best trial: 0. Best value: 0.0868771:  13%|█▎        | 2/15 [23:43<2:41:54, 747.28s/it]

[TUNE-END] trial=11 f1=0.0869
[I 2025-08-17 04:58:40,220] Trial 1 finished with value: 0.08687713959680486 and parameters: {'lr': 0.00032245872118113306, 'weight_decay': 4.9425739199642315e-05, 'unfreeze_last_k': 9, 'batch_size': 8}. Best is trial 0 with value: 0.08687713959680486.


[TUNE] trial=12 | epochs=8 bs=16 lr=1.05e-04 wd=5.1e-05 k=12


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/mdeberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Trainable params: 85,648,901 / 278,813,189 (30.72%) ; unfreeze_last_k=12


  trainer = Trainer(


[Run] epochs=8 bs=16 lr=1.05e-04 wd=5.1e-05 warmup_ratio=0.06 grad_accum=1
[e0 b100/2315] loss=1.6179 lr=9.34e-06
{'loss': 1.6179, 'grad_norm': 1.9921875, 'learning_rate': 9.340105462722638e-06, 'epoch': 0.04319654427645788}
[e0 b200/2315] loss=1.5963 lr=1.88e-05
{'loss': 1.5963, 'grad_norm': 1.90625, 'learning_rate': 1.8774555425068738e-05, 'epoch': 0.08639308855291576}
[e0 b300/2315] loss=1.5674 lr=2.82e-05
{'loss': 1.5674, 'grad_norm': 5.5, 'learning_rate': 2.8209005387414836e-05, 'epoch': 0.12958963282937366}
[e0 b400/2315] loss=1.5294 lr=3.76e-05
{'loss': 1.5294, 'grad_norm': 7.1875, 'learning_rate': 3.7643455349760934e-05, 'epoch': 0.17278617710583152}
[e0 b500/2315] loss=1.5100 lr=4.71e-05
{'loss': 1.51, 'grad_norm': 10.6875, 'learning_rate': 4.707790531210703e-05, 'epoch': 0.2159827213822894}
[e0 b600/2315] loss=1.4554 lr=5.65e-05
{'loss': 1.4554, 'grad_norm': 8.8125, 'learning_rate': 5.651235527445313e-05, 'epoch': 0.2591792656587473}
[e0 b700/2315] loss=1.3374 lr=6.59e-05
{'l

[34m[1mwandb[0m: [32m[41mERROR[0m The nbformat package was not found. It is required to save notebook history.


0,1
eval/accuracy,▁▄▆██▇███
eval/f1,▁▄▆██▇███
eval/loss,█▃▁▁▁▃▂▂▁
eval/precision,▁▅▆█▇▇▇▇█
eval/recall,▁▄▇██████
eval/runtime,█▃▄▁▅▅▁▆▇
eval/samples_per_second,▁▆▄█▃▃█▃▂
eval/steps_per_second,▁▆▄█▃▃█▃▂
train/epoch,▁▁▁▁▂▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▅▆▆▆▇▇▇▇▇▇▇█
train/global_step,▁▁▁▂▂▂▂▃▃▃▃▃▃▃▃▃▃▄▄▄▅▅▅▅▅▆▆▆▇▇▇▇████████

0,1
best_eval_f1,0.85882
best_model_ckpt,hf_ckpts\microsoft__...
eval/accuracy,0.85544
eval/f1,0.85882
eval/loss,0.49194
eval/precision,0.85476
eval/recall,0.86587
eval/runtime,4.1906
eval/samples_per_second,982.196
eval/steps_per_second,61.566


Best trial: 2. Best value: 0.858824:  20%|██        | 3/15 [37:53<2:38:51, 794.32s/it] 

[TUNE-END] trial=12 f1=0.8588
[I 2025-08-17 05:12:50,514] Trial 2 finished with value: 0.8588243859157035 and parameters: {'lr': 0.00010491108358128862, 'weight_decay': 5.114700076417737e-05, 'unfreeze_last_k': 12, 'batch_size': 16}. Best is trial 2 with value: 0.8588243859157035.


[TUNE] trial=13 | epochs=8 bs=8 lr=9.48e-05 wd=6.0e-05 k=11


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/mdeberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Trainable params: 78,561,029 / 278,813,189 (28.18%) ; unfreeze_last_k=11
[Run] epochs=8 bs=8 lr=9.48e-05 wd=6.0e-05 warmup_ratio=0.06 grad_accum=1
[e0 b100/4630] loss=1.6395 lr=4.22e-06
{'loss': 1.6395, 'grad_norm': 4.0625, 'learning_rate': 4.223401305424146e-06, 'epoch': 0.02159827213822894}
[e0 b200/4630] loss=1.6430 lr=8.49e-06
{'loss': 1.643, 'grad_norm': 5.03125, 'learning_rate': 8.489463230095e-06, 'epoch': 0.04319654427645788}
[e0 b300/4630] loss=1.6378 lr=1.28e-05
{'loss': 1.6378, 'grad_norm': 3.453125, 'learning_rate': 1.2755525154765854e-05, 'epoch': 0.06479481641468683}
[e0 b400/4630] loss=1.6172 lr=1.70e-05
{'loss': 1.6172, 'grad_norm': 2.953125, 'learning_rate': 1.702158707943671e-05, 'epoch': 0.08639308855291576}
[e0 b500/4630] loss=1.5862 lr=2.13e-05
{'loss': 1.5862, 'grad_norm': 6.8125, 'learning_rate': 2.1287649004107564e-05, 'epoch': 0.1079913606911447}
[e0 b600/4630] loss=1.5688 lr=2.56e-05
{'loss': 1.5688, 'grad_norm': 7.46875, 'learning_rate': 2.5553710928778416e-0

[34m[1mwandb[0m: [32m[41mERROR[0m The nbformat package was not found. It is required to save notebook history.


0,1
eval/accuracy,▁▆▆▇█████
eval/f1,▁▆▆▇█████
eval/loss,█▁▁▃▁▂▂▂▁
eval/precision,▁▆▆▇█████
eval/recall,▁▆▇██████
eval/runtime,▃▃▃▁▃█▅▂▂
eval/samples_per_second,▆▆▆█▆▁▄▇▆
eval/steps_per_second,▆▆▆█▆▁▄▇▆
train/epoch,▁▁▁▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇████
train/global_step,▁▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▃▃▃▃▃▄▄▄▄▅▅▅▅▅▆▆▆▇▇█████

0,1
best_eval_f1,0.84707
best_model_ckpt,hf_ckpts\microsoft__...
eval/accuracy,0.84281
eval/f1,0.84707
eval/loss,0.58782
eval/precision,0.8444
eval/recall,0.85254
eval/runtime,7.5543
eval/samples_per_second,544.853
eval/steps_per_second,68.173


Best trial: 2. Best value: 0.858824:  27%|██▋       | 4/15 [1:04:20<3:23:02, 1107.48s/it]

[TUNE-END] trial=13 f1=0.8471
[I 2025-08-17 05:39:18,074] Trial 3 finished with value: 0.8470683080135002 and parameters: {'lr': 9.483455658543309e-05, 'weight_decay': 6.0257538149883206e-05, 'unfreeze_last_k': 11, 'batch_size': 8}. Best is trial 2 with value: 0.8588243859157035.


[TUNE] trial=14 | epochs=8 bs=8 lr=9.10e-05 wd=1.3e-05 k=8


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/mdeberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Trainable params: 57,297,413 / 278,813,189 (20.55%) ; unfreeze_last_k=8


  trainer = Trainer(


[Run] epochs=8 bs=8 lr=9.10e-05 wd=1.3e-05 warmup_ratio=0.06 grad_accum=1
[e0 b100/4630] loss=1.6397 lr=4.05e-06
{'loss': 1.6397, 'grad_norm': 4.0625, 'learning_rate': 4.054367211723492e-06, 'epoch': 0.02159827213822894}
[e0 b200/4630] loss=1.6435 lr=8.15e-06
{'loss': 1.6435, 'grad_norm': 5.0625, 'learning_rate': 8.149687627605807e-06, 'epoch': 0.04319654427645788}
[e0 b300/4630] loss=1.6385 lr=1.22e-05
{'loss': 1.6385, 'grad_norm': 3.5, 'learning_rate': 1.2245008043488122e-05, 'epoch': 0.06479481641468683}
[e0 b400/4630] loss=1.6190 lr=1.63e-05
{'loss': 1.619, 'grad_norm': 2.40625, 'learning_rate': 1.634032845937044e-05, 'epoch': 0.08639308855291576}
[e0 b500/4630] loss=1.5936 lr=2.04e-05
{'loss': 1.5936, 'grad_norm': 6.21875, 'learning_rate': 2.0435648875252754e-05, 'epoch': 0.1079913606911447}
[e0 b600/4630] loss=1.5735 lr=2.45e-05
{'loss': 1.5735, 'grad_norm': 7.46875, 'learning_rate': 2.4530969291135068e-05, 'epoch': 0.12958963282937366}
[e0 b700/4630] loss=1.5477 lr=2.86e-05
{'lo

[34m[1mwandb[0m: [32m[41mERROR[0m The nbformat package was not found. It is required to save notebook history.


0,1
eval/accuracy,▁▆▇▇█████
eval/f1,▁▆▇▇█████
eval/loss,█▁▁▂▁▁▁▁▁
eval/precision,▁▆▇▇█▇███
eval/recall,▁▆▇██████
eval/runtime,▃▁█▂▆▄▂▆▅
eval/samples_per_second,▆█▁▆▃▅▇▂▄
eval/steps_per_second,▆█▁▆▃▅▇▂▄
train/epoch,▁▁▁▁▁▂▂▂▂▂▂▃▃▃▃▄▄▄▄▅▅▅▆▆▆▆▆▆▆▆▆▆▆▇▇▇████
train/global_step,▁▁▂▂▂▃▃▃▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇█████

0,1
best_eval_f1,0.81392
best_model_ckpt,hf_ckpts\microsoft__...
eval/accuracy,0.80709
eval/f1,0.81392
eval/loss,0.60945
eval/precision,0.81363
eval/recall,0.8191
eval/runtime,7.9214
eval/samples_per_second,519.606
eval/steps_per_second,65.014


Best trial: 2. Best value: 0.858824:  33%|███▎      | 5/15 [1:27:19<3:20:50, 1205.07s/it]

[TUNE-END] trial=14 f1=0.8139
[I 2025-08-17 06:02:16,175] Trial 4 finished with value: 0.8139152199680504 and parameters: {'lr': 9.103897284506387e-05, 'weight_decay': 1.33939343338835e-05, 'unfreeze_last_k': 8, 'batch_size': 8}. Best is trial 2 with value: 0.8588243859157035.


[TUNE] trial=15 | epochs=8 bs=16 lr=3.76e-04 wd=5.7e-06 k=11


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/mdeberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Trainable params: 78,561,029 / 278,813,189 (28.18%) ; unfreeze_last_k=11


  trainer = Trainer(


[Run] epochs=8 bs=16 lr=3.76e-04 wd=5.7e-06 warmup_ratio=0.06 grad_accum=1
[e0 b100/2315] loss=1.6335 lr=3.35e-05
{'loss': 1.6335, 'grad_norm': 2.3125, 'learning_rate': 3.345477547562474e-05, 'epoch': 0.04319654427645788}
[e0 b200/2315] loss=1.5646 lr=6.72e-05
{'loss': 1.5646, 'grad_norm': 6.125, 'learning_rate': 6.724747797625579e-05, 'epoch': 0.08639308855291576}
[e0 b300/2315] loss=1.5564 lr=1.01e-04
{'loss': 1.5564, 'grad_norm': 1.9375, 'learning_rate': 0.00010104018047688683, 'epoch': 0.12958963282937366}
[e0 b400/2315] loss=1.5355 lr=1.35e-04
{'loss': 1.5355, 'grad_norm': 5.65625, 'learning_rate': 0.00013483288297751788, 'epoch': 0.17278617710583152}
[e0 b500/2315] loss=1.5601 lr=1.69e-04
{'loss': 1.5601, 'grad_norm': 1.96875, 'learning_rate': 0.00016862558547814895, 'epoch': 0.2159827213822894}
[e0 b600/2315] loss=1.5900 lr=2.02e-04
{'loss': 1.59, 'grad_norm': 1.3359375, 'learning_rate': 0.00020241828797877998, 'epoch': 0.2591792656587473}
[e0 b700/2315] loss=1.5881 lr=2.36e-04


[34m[1mwandb[0m: [32m[41mERROR[0m The nbformat package was not found. It is required to save notebook history.


0,1
eval/accuracy,▁▁▁▁▁▁
eval/f1,▁▁▁▁▁▁
eval/loss,▅█▃▁▂▅
eval/precision,▁▁▁▁▁▁
eval/recall,▁▁▁▁▁▁
eval/runtime,▅▁▃▄██
eval/samples_per_second,▄█▆▅▁▁
eval/steps_per_second,▄█▆▅▁▁
train/epoch,▁▁▁▁▁▂▃▃▃▃▃▃▃▄▄▄▄▄▄▄▅▅▅▅▅▅▅▅▆▆▆▆▇▇▇▇▇███
train/global_step,▁▁▁▂▂▂▂▂▃▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▇▇▇▇▇█████

0,1
best_eval_f1,0.08688
best_model_ckpt,hf_ckpts\microsoft__...
eval/accuracy,0.27745
eval/f1,0.08688
eval/loss,1.57726
eval/precision,0.05549
eval/recall,0.2
eval/runtime,4.0067
eval/samples_per_second,1027.282
eval/steps_per_second,64.392


Best trial: 2. Best value: 0.858824:  40%|████      | 6/15 [1:35:50<2:25:22, 969.20s/it] 

[TUNE-END] trial=15 f1=0.0869
[I 2025-08-17 06:10:47,524] Trial 5 finished with value: 0.08687713959680486 and parameters: {'lr': 0.0003757748518070173, 'weight_decay': 5.704980917227296e-06, 'unfreeze_last_k': 11, 'batch_size': 16}. Best is trial 2 with value: 0.8588243859157035.


[TUNE] trial=16 | epochs=8 bs=32 lr=2.74e-04 wd=4.1e-06 k=12


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/mdeberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Trainable params: 85,648,901 / 278,813,189 (30.72%) ; unfreeze_last_k=12
[Run] epochs=8 bs=32 lr=2.74e-04 wd=4.1e-06 warmup_ratio=0.06 grad_accum=1
[e0 b100/1158] loss=1.5953 lr=4.88e-05
{'loss': 1.5953, 'grad_norm': 2.34375, 'learning_rate': 4.876471793523061e-05, 'epoch': 0.08635578583765112}
[e0 b200/1158] loss=1.5252 lr=9.80e-05
{'loss': 1.5252, 'grad_norm': 9.4375, 'learning_rate': 9.802200877889789e-05, 'epoch': 0.17271157167530224}
[e0 b300/1158] loss=1.2682 lr=1.47e-04
{'loss': 1.2682, 'grad_norm': 7.875, 'learning_rate': 0.00014727929962256517, 'epoch': 0.25906735751295334}
[e0 b400/1158] loss=1.1961 lr=1.97e-04
{'loss': 1.1961, 'grad_norm': 10.25, 'learning_rate': 0.00019653659046623244, 'epoch': 0.3454231433506045}
[e0 b500/1158] loss=1.1168 lr=2.46e-04
{'loss': 1.1168, 'grad_norm': 5.5, 'learning_rate': 0.00024579388130989974, 'epoch': 0.4317789291882556}
[e0 b600/1158] loss=1.1080 lr=2.73e-04
{'loss': 1.108, 'grad_norm': 6.40625, 'learning_rate': 0.0002725181676494828, 'ep

[34m[1mwandb[0m: [32m[41mERROR[0m The nbformat package was not found. It is required to save notebook history.


0,1
eval/accuracy,▁▁▁▁▁▁
eval/f1,▁▁▁▁▁▁
eval/loss,█▁▁▁▁█
eval/precision,▁▁▁▁▁▁
eval/recall,▁▁▁▁▁▁
eval/runtime,▁▂█▁▇▃
eval/samples_per_second,█▇▁█▂▆
eval/steps_per_second,█▇▁█▂▆
train/epoch,▁▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇████
train/global_step,▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇█████

0,1
best_eval_f1,0.08688
best_model_ckpt,hf_ckpts\microsoft__...
eval/accuracy,0.27745
eval/f1,0.08688
eval/loss,1.58939
eval/precision,0.05549
eval/recall,0.2
eval/runtime,2.074
eval/samples_per_second,1984.587
eval/steps_per_second,62.199


Best trial: 2. Best value: 0.858824:  47%|████▋     | 7/15 [1:40:46<1:39:51, 749.00s/it]

[TUNE-END] trial=16 f1=0.0869
[I 2025-08-17 06:15:43,149] Trial 6 finished with value: 0.08687713959680486 and parameters: {'lr': 0.0002738705370907901, 'weight_decay': 4.122290115221052e-06, 'unfreeze_last_k': 12, 'batch_size': 32}. Best is trial 2 with value: 0.8588243859157035.


[TUNE] trial=17 | epochs=8 bs=32 lr=8.59e-05 wd=1.2e-05 k=11


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/mdeberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Trainable params: 78,561,029 / 278,813,189 (28.18%) ; unfreeze_last_k=11
[Run] epochs=8 bs=32 lr=8.59e-05 wd=1.2e-05 warmup_ratio=0.06 grad_accum=1
[e0 b100/1158] loss=1.6139 lr=1.53e-05
{'loss': 1.6139, 'grad_norm': 1.2109375, 'learning_rate': 1.5288851703800565e-05, 'epoch': 0.08635578583765112}
[e0 b200/1158] loss=1.5723 lr=3.07e-05
{'loss': 1.5723, 'grad_norm': 4.125, 'learning_rate': 3.073213625309406e-05, 'epoch': 0.17271157167530224}
[e0 b300/1158] loss=1.5126 lr=4.62e-05
{'loss': 1.5126, 'grad_norm': 5.1875, 'learning_rate': 4.617542080238756e-05, 'epoch': 0.25906735751295334}
[e0 b400/1158] loss=1.4145 lr=6.16e-05
{'loss': 1.4145, 'grad_norm': 11.75, 'learning_rate': 6.161870535168107e-05, 'epoch': 0.3454231433506045}
[e0 b500/1158] loss=1.1977 lr=7.71e-05
{'loss': 1.1977, 'grad_norm': 14.0625, 'learning_rate': 7.706198990097456e-05, 'epoch': 0.4317789291882556}
[e0 b600/1158] loss=1.0439 lr=8.54e-05
{'loss': 1.0439, 'grad_norm': 11.9375, 'learning_rate': 8.54406634181365e-05,

[34m[1mwandb[0m: [32m[41mERROR[0m The nbformat package was not found. It is required to save notebook history.


0,1
eval/accuracy,▁▅▇██▇███
eval/f1,▁▅▇██▇███
eval/loss,█▄▂▁▁▂▂▂▂
eval/precision,▁▅▇██▇███
eval/recall,▁▅▇██████
eval/runtime,██▅▂▁▆▂▆▂
eval/samples_per_second,▁▁▄▇█▃▇▂▇
eval/steps_per_second,▁▁▄▇█▃▇▂▇
train/epoch,▁▁▁▁▁▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▆▆▇▇▇▇▇▇██
train/global_step,▁▁▁▁▁▂▂▂▂▂▂▃▃▃▃▃▃▃▄▄▄▄▄▄▅▅▅▆▆▆▆▆▇▇▇▇▇▇██

0,1
best_eval_f1,0.83215
best_model_ckpt,hf_ckpts\microsoft__...
eval/accuracy,0.8275
eval/f1,0.83215
eval/loss,0.53256
eval/precision,0.82925
eval/recall,0.84256
eval/runtime,2.006
eval/samples_per_second,2051.865
eval/steps_per_second,64.308


Best trial: 2. Best value: 0.858824:  53%|█████▎    | 8/15 [1:48:09<1:16:01, 651.58s/it]

[TUNE-END] trial=17 f1=0.8321
[I 2025-08-17 06:23:06,152] Trial 7 finished with value: 0.832148005149769 and parameters: {'lr': 8.586466209407186e-05, 'weight_decay': 1.1815206011878946e-05, 'unfreeze_last_k': 11, 'batch_size': 32}. Best is trial 2 with value: 0.8588243859157035.


[TUNE] trial=18 | epochs=8 bs=128 lr=1.04e-04 wd=4.6e-05 k=11


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/mdeberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Trainable params: 78,561,029 / 278,813,189 (28.18%) ; unfreeze_last_k=11
[Run] epochs=8 bs=128 lr=1.04e-04 wd=4.6e-05 warmup_ratio=0.06 grad_accum=1
[e0 b100/290] loss=1.5677 lr=7.38e-05
{'loss': 1.5677, 'grad_norm': 3.515625, 'learning_rate': 7.378932090650918e-05, 'epoch': 0.3448275862068966}
[e0 b200/290] loss=1.1213 lr=1.02e-04
{'loss': 1.1213, 'grad_norm': 11.5, 'learning_rate': 0.00010152442289861383, 'epoch': 0.6896551724137931}


Best trial: 2. Best value: 0.858824:  53%|█████▎    | 8/15 [1:49:57<1:36:13, 824.74s/it]


[W 2025-08-17 06:24:55,040] Trial 8 failed with parameters: {'lr': 0.00010434853461526551, 'weight_decay': 4.6344760009348534e-05, 'unfreeze_last_k': 11, 'batch_size': 128} because of the following error: RuntimeError('CUDA error: out of memory\nCUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1\nCompile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.\n').
Traceback (most recent call last):
  File "C:\Users\adishalit1\AppData\Local\anaconda3\envs\dl4090\Lib\site-packages\optuna\study\_optimize.py", line 201, in _run_trial
    value_or_values = func(trial)
                      ^^^^^^^^^^^
  File "C:\Users\adishalit1\AppData\Local\Temp\ipykernel_29804\3473936884.py", line 151, in objective
    trainer.train()
  File "C:\Users\adishalit1\AppData\Local\anaconda3\envs\dl4090\Lib\site-packages\transformers\trainer.py", line 2238, in train
    return inner_t

RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


# 📊 Ex.5 — Optuna Trials (11–18): Extended Results  

After the first 10 runs, we extended the Optuna search to more trials.  
Here’s what we observed:  

---

### ✅ Stronger Trials
- **Trial 12** → LR ≈ 1.0e-4, WD ≈ 5e-5, batch size 16, unfreeze 12 → **F1 = 0.859**  
  → This is our **best so far**. Confirms that pushing LR slightly above 9e-5 can work if weight decay is kept moderate.  
- **Trial 13** → LR ≈ 9.4e-5, WD ≈ 6e-5, batch size 8, unfreeze 11 → **F1 = 0.847**  
  → Stable, good consistency. Smaller batch helped here.  
- **Trial 11** → LR ≈ 1.2e-4, WD ≈ 8.6e-5, batch size 16, unfreeze 8 → **F1 = 0.836**  
  → Still decent, but shallower unfreezing (only 8 layers) seems to cap performance.  
- **Trial 17** → LR ≈ 8.6e-5, WD ≈ 1e-5, batch size 32, unfreeze 11 → **F1 = 0.832**  
  → Confirms the LR “sweet spot.” Batch size 32 still works fine.  

---

### ⚠️ Medium Results
- **Trial 14** → LR ≈ 9.1e-5, WD ≈ 1.3e-5, batch size 8, unfreeze 8 → **F1 = 0.814**  
  → Again, too shallow unfreezing limits the score.  

---

### ❌ Collapse Zone
Several trials **completely collapsed (F1 ≈ 0.086)**:  
- Trial 10, 15, 16, 11 (second run), 18 (no result)  
  - All used **very large LR (≥4e-4)**.  
  - This confirms that once LR crosses ~1.5e-4, the model becomes unstable and fails to learn.  

---

### 🧾 Takeaways
- **Best range** remains **LR ≈ 8e-5 → 1e-4**, with **weight decay ~1e-5 → 6e-5**.  
- **Deep unfreezing (10–12 layers)** is almost always required for high F1.  
- **Batch size flexibility**: both 16 and 32 work, 8 sometimes helps but not always.  
- **Large LR (≥2e-4)** → catastrophic collapse (F1 ~ 0.08).  

---

👉 Overall, Trials 12 and 13 show we can **push beyond 0.85 F1** with careful tuning.  
But the search space is still fragile — a small step in LR can cause collapse.  
Next step: maybe test **schedulers / warmup ratio tweaks** to stabilize training.  


# 🏁 Ex.5 — Final Training & Test Evaluation  

Now that we finished **hyperparameter tuning with Optuna**, it’s time for the **final run**:  
we take the **best HPs** (learning rate, weight decay, batch size, unfreezing depth, epochs)  
and retrain a **fresh model from scratch** with them.  

---

### 🔨 What happens in this stage:
1. **Load best hyperparameters** — either directly from the Optuna study in memory,  
   or from the saved JSON (`hf_ckpts/best_params_optuna1.json`).  
2. **Rebuild the model** — freeze everything, then unfreeze the best `k` last layers,  
   keeping the classifier head trainable.  
3. **Training setup** — longer training (15 epochs), early stopping, bf16/fp16 acceleration,  
   and fused AdamW optimizer for speed.  
4. **Train & save checkpoints** — log everything to W&B, keep the best model at the end.  
5. **Evaluate** — report metrics on both validation and the clean translated **test set**.  
   We include accuracy, precision, recall, macro-F1, plus full per-class report and confusion matrix.  

---

👉 This gives us the **final performance numbers** of our pipeline,  
and also saves the best model checkpoint so we can reuse it later.


# Load Best Model EX5

In [8]:
# ===== Final train on best HPs, save model, and evaluate on test =====
import os, json, time, importlib.util
import numpy as np
import pandas as pd
import torch
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, classification_report, confusion_matrix
import wandb
from transformers import TrainingArguments, Trainer, EarlyStoppingCallback, AutoModelForSequenceClassification


# --- 1) Load best hyperparameters (from current study or from disk) ---
best_params_path = "hf_ckpts/best_params_optuna1.json"
if 'study' in globals() and study.best_trial is not None:
    bp = {
        "lr": float(study.best_trial.params["lr"]),
        "weight_decay": float(study.best_trial.params["weight_decay"]),
        "num_unfreeze_last_layers": int(study.best_trial.params["unfreeze_last_k"]),
        "batch_size": int(study.best_trial.params["batch_size"]),
        "epochs": int(15),  # keep your requested final epochs
    }
    print("[final] Using best params from in-memory study:", bp)
    os.makedirs(os.path.dirname(best_params_path), exist_ok=True)
    with open(best_params_path, "w") as f: json.dump(bp, f, indent=2)
else:
    with open(best_params_path, "r") as f:
        bp = json.load(f)
    print("[final] Loaded best params from file:", bp)

FINAL_EPOCHS   = int(bp.get("epochs", EPOCHS))
FINAL_LR       = float(bp["lr"])
FINAL_WD       = float(bp["weight_decay"])
FINAL_BS       = int(bp["batch_size"])
FINAL_UNFREEZE = int(bp["num_unfreeze_last_layers"])

# --- 2) Rebuild a fresh model with the chosen unfreezing plan ---
def build_model_for_final(unfreeze_last_k: int):
    model = AutoModelForSequenceClassification.from_pretrained(
        MODEL_NAME,
        num_labels=len(ORDER),
        id2label=ID2LABEL,
        label2id=LABEL2ID,
        torch_dtype=(torch.bfloat16 if torch.cuda.is_available() and torch.cuda.is_bf16_supported() else None),
        use_safetensors=True,
    )
    base = getattr(model, "deberta", None) or getattr(model, "roberta", None) or getattr(model, "bert", None)
    if base is not None and hasattr(base, "encoder") and hasattr(base.encoder, "layer"):
        # freeze all, then unfreeze last k transformer blocks
        for p in base.parameters(): p.requires_grad = False
        for layer in base.encoder.layer[-int(unfreeze_last_k):]:
            for p in layer.parameters(): p.requires_grad = True
    for p in model.classifier.parameters(): p.requires_grad = True

    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
    total = sum(p.numel() for p in model.parameters())
    print(f"[final] Trainable params: {trainable:,} / {total:,} ({100*trainable/total:.2f}%) ; unfreeze_last_k={unfreeze_last_k}")
    return model.to(DEVICE)

model = build_model_for_final(FINAL_UNFREEZE)

# --- 3) Final TrainingArguments (bf16 on 4090; fused AdamW; no torch.compile) ---
bf16_ok = torch.cuda.is_available() and torch.cuda.is_bf16_supported()
timestamp = time.strftime("%Y%m%d_%H%M%S")
final_run_name = f"{BASE_RUN_NAME}__final_{timestamp}"
final_out_dir  = os.path.join("hf_ckpts", final_run_name)
os.makedirs(final_out_dir, exist_ok=True)

training_args = TrainingArguments(
    output_dir=final_out_dir,
    run_name=final_run_name,
    report_to=["wandb"],

    learning_rate=FINAL_LR,
    weight_decay=FINAL_WD,
    warmup_ratio=WARMUP_RATIO,
    num_train_epochs=FINAL_EPOCHS,

    per_device_train_batch_size=FINAL_BS,
    per_device_eval_batch_size=FINAL_BS,
    gradient_accumulation_steps=GRAD_ACCUM,

    dataloader_num_workers=max(0, NUM_WORKERS),   # keep your Windows-stable choice
    dataloader_pin_memory=True,
    dataloader_prefetch_factor=2 if NUM_WORKERS > 0 else None,

    fp16=(torch.cuda.is_available() and not bf16_ok),
    bf16=bf16_ok,
    optim=("adamw_torch_fused" if torch.cuda.is_available() else "adamw_torch"),
    save_safetensors=True,

    logging_strategy="steps",
    logging_steps=100,
    eval_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=2,
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    greater_is_better=True,

    torch_compile=False,   # keep off on Windows unless Triton is present & stable
    seed=SEED,
)

# --- 4) W&B run for the final train ---
wandb_run = wandb.init(
    project=WANDB_PROJECT,
    name=final_run_name,
    group=BASE_RUN_NAME + "_final",
    job_type="final-train",
    tags=["final","trainer","ex5"],
    config={
        "model": MODEL_NAME,
        "max_len": MAX_LEN,
        "epochs": FINAL_EPOCHS,
        "lr": FINAL_LR,
        "weight_decay": FINAL_WD,
        "warmup_ratio": WARMUP_RATIO,
        "batch_size": FINAL_BS,
        "grad_accum": GRAD_ACCUM,
        "unfreeze_last_k": FINAL_UNFREEZE,
        "num_workers": NUM_WORKERS,
        "bf16": bf16_ok,
    },
)

print(f"[final] device={DEVICE} bf16={bf16_ok} fp16={training_args.fp16} | out_dir={final_out_dir}")
print(f"[final] HP → bs={FINAL_BS} lr={FINAL_LR:.2e} wd={FINAL_WD:.1e} epochs={FINAL_EPOCHS} unfreeze_last_k={FINAL_UNFREEZE}")

# --- 5) Trainer + train ---
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_ds,
    eval_dataset=val_ds,
    tokenizer=tokenizer,                 # ok; deprecates in v5 (processing_class later)
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    callbacks=[PrintAndWBCallback(print_every=100),
               EarlyStoppingCallback(early_stopping_patience=PATIENCE)],
)

print("[final] Starting training …")
train_out = trainer.train()
print("[final] TrainOutput:", train_out)
print("[final] Best checkpoint:", trainer.state.best_model_checkpoint)

# --- 6) Evaluate on validation & test, save detailed reports ---
print("[final] Evaluating on validation set …")
val_metrics = trainer.evaluate(eval_dataset=val_ds)
print("[final][val] →", val_metrics)
wandb.log({f"final/val_{k}": v for k, v in val_metrics.items()})

print("[final] Evaluating on TEST set …")
test_pred = trainer.predict(test_ds)
test_logits = test_pred.predictions
test_labels = test_pred.label_ids
test_preds  = np.argmax(test_logits, axis=-1)

test_acc = accuracy_score(test_labels, test_preds)
p, r, f1, _ = precision_recall_fscore_support(test_labels, test_preds, average="macro", zero_division=0)
test_metrics = {"test_accuracy": float(test_acc), "test_precision_macro": float(p), "test_recall_macro": float(r), "test_f1_macro": float(f1)}
print("[final][test] →", test_metrics)
wandb.log({f"final/{k}": v for k, v in test_metrics.items()})

# per-class report & confusion matrix
report_dict = classification_report(test_labels, test_preds, target_names=ORDER, digits=4, output_dict=True)
cm = confusion_matrix(test_labels, test_preds, labels=list(range(len(ORDER))))



[final] Loaded best params from file: {'lr': 0.00010491108358128862, 'weight_decay': 5.114700076417737e-05, 'num_unfreeze_last_layers': 12, 'batch_size': 16, 'epochs': 12}


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/mdeberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[final] Trainable params: 85,648,901 / 278,813,189 (30.72%) ; unfreeze_last_k=12


0,1
train/epoch,▁▁▂▂▃▃▃▄▄▅▅▆▆▆▇▇██
train/global_step,▁▁▂▂▃▃▃▄▄▅▅▆▆▆▇▇██
train/grad_norm,▁▁▁▂▃▂▃▆▄▇▄▆▆▄▄█▄▅
train/learning_rate,▁▁▂▂▃▃▄▄▅▅▅▆▆▇▇███
train/loss,████▇▇▆▅▄▄▃▂▂▂▁▂▁▁

0,1
train/epoch,0.77754
train/global_step,1800.0
train/grad_norm,13.4375
train/learning_rate,0.0001
train/loss,0.8664


[final] device=cuda bf16=True fp16=False | out_dir=hf_ckpts\microsoft__mdeberta-v3-base_ex5_trainer-try__final_20250817_104013
[final] HP → bs=16 lr=1.05e-04 wd=5.1e-05 epochs=12 unfreeze_last_k=12
[final] Starting training …


  trainer = Trainer(


[Run] epochs=12 bs=16 lr=1.05e-04 wd=5.1e-05 warmup_ratio=0.06 grad_accum=1


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,0.7431,0.713724,0.729106,0.761852,0.721231,0.736664
2,0.4685,0.510003,0.827017,0.843368,0.824059,0.831787
3,0.4428,0.587996,0.81171,0.811926,0.834801,0.814972
4,0.3409,0.469016,0.850826,0.852043,0.858283,0.8545
5,0.2814,0.487433,0.853984,0.856803,0.861751,0.857645
6,0.2758,0.460352,0.861516,0.860913,0.870439,0.864878
7,0.2485,0.500496,0.863946,0.86211,0.875398,0.867273
8,0.2435,0.506808,0.860787,0.857589,0.872901,0.863752
9,0.2496,0.497311,0.863946,0.862095,0.874121,0.867178
10,0.1994,0.514455,0.859572,0.855997,0.872231,0.862454


[e0 b100/2315] loss=1.6420 lr=6.23e-06
[e0 b200/2315] loss=1.6353 lr=1.25e-05
[e0 b300/2315] loss=1.6074 lr=1.88e-05
[e0 b400/2315] loss=1.5622 lr=2.51e-05
[e0 b500/2315] loss=1.5431 lr=3.14e-05
[e0 b600/2315] loss=1.4967 lr=3.77e-05
[e0 b700/2315] loss=1.4243 lr=4.40e-05
[e0 b800/2315] loss=1.3462 lr=5.03e-05
[e0 b900/2315] loss=1.2049 lr=5.66e-05
[e0 b1000/2315] loss=1.1781 lr=6.29e-05
[e0 b1100/2315] loss=1.0698 lr=6.92e-05
[e0 b1200/2315] loss=1.0031 lr=7.55e-05
[e0 b1300/2315] loss=0.9656 lr=8.18e-05
[e0 b1400/2315] loss=0.9317 lr=8.80e-05
[e0 b1500/2315] loss=0.9780 lr=9.43e-05
[e0 b1600/2315] loss=0.9174 lr=1.01e-04
[e0 b1700/2315] loss=0.8946 lr=1.05e-04
[e0 b1800/2315] loss=0.8692 lr=1.04e-04
[e0 b1900/2315] loss=0.8118 lr=1.04e-04
[e0 b2000/2315] loss=0.8220 lr=1.04e-04
[e0 b2100/2315] loss=0.7980 lr=1.03e-04
[e0 b2200/2315] loss=0.7537 lr=1.03e-04
[e0 b2300/2315] loss=0.7431 lr=1.02e-04
[val @ epoch 1] acc=0.7291 f1=0.7367 p=0.7619 r=0.7212
[val @ epoch 2] acc=0.8270 f1=0.83

[val @ epoch 11] acc=0.8639 f1=0.8673 p=0.8621 r=0.8754
[final][val] → {'eval_loss': 0.5004957914352417, 'eval_accuracy': 0.8639455782312925, 'eval_precision': 0.8621102868494844, 'eval_recall': 0.8753979815762152, 'eval_f1': 0.8672727518751845, 'eval_runtime': 4.4479, 'eval_samples_per_second': 925.389, 'eval_steps_per_second': 58.005, 'epoch': 11.0}
[final] Evaluating on TEST set …
[final][test] → {'test_accuracy': 0.8433385992627699, 'test_precision_macro': 0.8464462108232367, 'test_recall_macro': 0.8524153135872158, 'test_f1_macro': 0.8476267133422806}


# 📊 Ex.5 — Final Results (Validation & Test)

After retraining the model with the **best hyperparameters** from Optuna,  
we get the following performance:

---

## ✅ Validation set
- **Accuracy:** 0.8640  
- **Macro F1:** 0.8673  
- **Macro Precision:** 0.8621  
- **Macro Recall:** 0.8754  

> The model is very consistent on validation, reaching ~86–87% across all main metrics.  

---

## 🧪 Test set (clean translated)
- **Accuracy:** 0.8433  
- **Macro F1:** 0.8476  
- **Macro Precision:** 0.8464  
- **Macro Recall:** 0.8524  

> On the unseen test set, performance drops slightly (≈2% lower than validation),  
> but results are still strong and balanced across classes.  

---

### 🎓 Takeaway
- Compared to our **Ex.4 training**, where the test **Macro F1 was 0.8685**,  
  this run is a bit **weaker (~0.85 F1)**.  
- Still, the model shows **good generalization and stable performance**.  
- The Validation ↔ Test gap remains small → no major overfitting.  
- Overall, with **macro F1 ~0.85**, the model captures all sentiment classes fairly well.
---
After finishing the final training and test evaluation,  
we saved **all important artifacts** for reproducibility and later analysis.  



### 📂 What we saved
- **Model + Tokenizer** (best checkpoint) → can be reloaded for inference or fine-tuning.  
- **Test metrics JSON** → contains summary scores (accuracy, precision, recall, macro F1).  
- **Classification report (CSV)** → per-class metrics (precision, recall, F1, support).  
- **Confusion matrix (CSV)** → class-wise confusion analysis.  


In [10]:
# --- 6) Save evaluation artifacts (Windows-safe: use W&B Artifacts, no symlinks) ---
import time
rep_df = pd.DataFrame(report_dict).transpose()
cm_df  = pd.DataFrame(cm, index=[f"true_{c}" for c in ORDER], columns=[f"pred_{c}" for c in ORDER])

rep_path  = os.path.join(final_out_dir, "classification_report_test.csv")
cm_path   = os.path.join(final_out_dir, "confusion_matrix_test.csv")
json_path = os.path.join(final_out_dir, "test_metrics.json")

rep_df.to_csv(rep_path, index=True)
cm_df.to_csv(cm_path, index=True)
with open(json_path, "w") as f:
    json.dump(test_metrics, f, indent=2)

# ⛔️ Remove wandb.save(...) — it uses symlinks on Windows and fails.
# Use an Artifact instead:
ts = time.strftime("%Y%m%d_%H%M%S")
eval_art = wandb.Artifact(f"{BASE_RUN_NAME}-eval-{ts}", type="evaluation")
eval_art.add_file(rep_path,  name="classification_report_test.csv")
eval_art.add_file(cm_path,   name="confusion_matrix_test.csv")
eval_art.add_file(json_path, name="test_metrics.json")
wandb_run.log_artifact(eval_art)

# --- 7) Save the final model (best weights) + tokenizer ---
save_dir = os.path.join(final_out_dir, "best_model")
os.makedirs(save_dir, exist_ok=True)
trainer.save_model(save_dir)        # saves model + trainer state
tokenizer.save_pretrained(save_dir) # saves tokenizer

# a tiny README for the checkpoint
with open(os.path.join(save_dir, "README.txt"), "w", encoding="utf-8") as f:
    f.write(
        f"Model: {MODEL_NAME}\n"
        f"Labels: {ORDER}\n"
        f"hp: lr={FINAL_LR}, weight_decay={FINAL_WD}, batch_size={FINAL_BS}, "
        f"epochs={FINAL_EPOCHS}, unfreeze_last_k={FINAL_UNFREEZE}\n"
        f"Val metrics: {val_metrics}\n"
        f"Test metrics: {test_metrics}\n"
    )

# (optional but nice) Log the whole model directory as a model artifact too
model_art = wandb.Artifact(f"{BASE_RUN_NAME}-model-{ts}", type="model")
model_art.add_dir(save_dir)
wandb_run.log_artifact(model_art)

# Summaries + finish
wandb_run.summary["best_checkpoint_dir"] = trainer.state.best_model_checkpoint
wandb_run.summary.update({f"final/{k}": v for k, v in test_metrics.items()})
wandb.finish()

print("\n=== DONE ===")
print("Saved:")
print(" • Model + tokenizer →", save_dir)
print(" • Test metrics JSON →", json_path)
print(" • Classification report CSV →", rep_path)
print(" • Confusion matrix CSV →", cm_path)


[34m[1mwandb[0m: Adding directory to artifact (.\hf_ckpts\microsoft__mdeberta-v3-base_ex5_trainer-try__final_20250817_104013\best_model)... Done. 0.9s
[34m[1mwandb[0m: [32m[41mERROR[0m The nbformat package was not found. It is required to save notebook history.


0,1
eval/accuracy,▁▆▅▇▇███████
eval/f1,▁▆▅▇▇███████
eval/loss,█▂▅▁▂▁▂▂▂▂▃▂
eval/precision,▁▇▄▇████████
eval/recall,▁▆▆▇▇███████
eval/runtime,▂▁█▂▅█▅▇▂▃▂▄
eval/samples_per_second,▇█▁▇▄▁▄▂▇▆▇▄
eval/steps_per_second,▇█▁▇▄▁▄▂▇▆▇▄
final/test_accuracy,▁
final/test_f1_macro,▁

0,1
best_checkpoint_dir,hf_ckpts\microsoft__...
eval/accuracy,0.86395
eval/f1,0.86727
eval/loss,0.5005
eval/precision,0.86211
eval/recall,0.8754
eval/runtime,4.4479
eval/samples_per_second,925.389
eval/steps_per_second,58.005
final/test_accuracy,0.84334



=== DONE ===
Saved:
 • Model + tokenizer → hf_ckpts\microsoft__mdeberta-v3-base_ex5_trainer-try__final_20250817_104013\best_model
 • Test metrics JSON → hf_ckpts\microsoft__mdeberta-v3-base_ex5_trainer-try__final_20250817_104013\test_metrics.json
 • Classification report CSV → hf_ckpts\microsoft__mdeberta-v3-base_ex5_trainer-try__final_20250817_104013\classification_report_test.csv
 • Confusion matrix CSV → hf_ckpts\microsoft__mdeberta-v3-base_ex5_trainer-try__final_20250817_104013\confusion_matrix_test.csv


In [11]:
cm_df

Unnamed: 0,pred_extremely negative,pred_negative,pred_neutral,pred_positive,pred_extremely positive
true_extremely negative,538,50,2,2,0
true_negative,104,873,19,41,4
true_neutral,5,64,506,42,2
true_positive,4,64,32,733,114
true_extremely positive,1,1,0,44,553


In [12]:
pt_path = os.path.join(final_out_dir, "best_model_ex5.pt")
torch.save(trainer.model.state_dict(), pt_path)
print("Saved state_dict →", pt_path)


Saved state_dict → hf_ckpts\microsoft__mdeberta-v3-base_ex5_trainer-try__final_20250817_104013\best_model_ex5.pt


# ⚖️  Weighted Loss Experiment (mid-class emphasis)

In this extra experiment we re-run the **full training loop (Ex.4 style)**,  
but we make one important change: **the loss function now gives more weight to the “middle” classes**  
(`negative`, `neutral`, `positive`).  

---

### 🎯 Why we do this
- In earlier runs, the model was already strong on the **extreme classes** (`extremely negative`, `extremely positive`).  
- Performance on the **middle classes** was weaker (more confusion between neutral/positive/negative).  
- To balance this, we **boost the loss weight** for the mid-classes so that mistakes on them are penalized more.  
  → This should encourage the model to learn better boundaries between these subtle sentiments.  

---

### 🔧 How we apply it
- We start from the **inverse-frequency class weights** (baseline).  
- Then, we multiply the weights of the mid-classes by a fixed **boost factor (e.g. 1.5×)**.  
- Finally, we normalize so the average weight stays ~1.0, keeping training stable.  
- Loss = **Weighted CrossEntropy**, optionally with small label smoothing.  

---

### 📊 Hyperparameters
- For this **first run** we kept the HP search range the same as before:  
  - `lr` in **8e-5 – 1e-3** (log scale)  
  - `weight_decay` in **1e-6 – 1e-4**  
  - Unfreeze depth: **8–12 layers**  
  - Batch size: {4, 8, 16, 32, 64}  
- The only difference is the **loss weighting strategy**.  

---

### 🔮 Next step
- After this first weighted-loss run, we may also include the **mid-class boost ratio itself**  
as a tunable hyperparameter in Optuna (instead of fixing it to 1.5).  
- That way, the study can find the **optimal balance** between extremes vs. middle classes.  

---

✅ So in short: same training process as Ex.4,  
but **loss re-weighting** to focus on the harder mid classes.


In [None]:
# # =========================
# # ADV DL – Part B: Monolingual baseline (RoBERTa) – Exercise-4 style
# # Custom loop + early stopping + W&B + Optuna ONLY; freeze base, unfreeze last k layers
# # Uses df_train / df_test with columns: OriginalTweet (str), Sentiment (str)
# # =========================

# import os, math, random, time, json
# from typing import Dict, List, Tuple

# import numpy as np
# import pandas as pd
# from sklearn.model_selection import train_test_split
# from sklearn.metrics import accuracy_score, precision_recall_fscore_support, classification_report, confusion_matrix
# import torch
# from torch.utils.data import Dataset, DataLoader
# from torch.cuda.amp import autocast, GradScaler
# import torch.nn.functional as F
# from collections import Counter

# # ---- deps ----
# # !pip -q install transformers==4.43.3 optuna==3.6.1 wandb==0.17.5 >/dev/null

# import transformers
# from transformers import (
#     AutoTokenizer, AutoModelForSequenceClassification,
#     DataCollatorWithPadding, get_linear_schedule_with_warmup
# )
# import os
# os.environ["TRANSFORMERS_NO_TF"] = "1"
# os.environ["TRANSFORMERS_NO_FLAX"] = "1"

# import optuna
# import wandb

# # -------------------------
# # Constants (no CFG, Optuna-only workflow)
# # -------------------------
# MODEL_NAME = "microsoft/mdeberta-v3-base"
# MAX_LEN = 512
# BATCH_SIZE = 16
# WARMUP_RATIO = 0.06
# GRAD_CLIP = 1.0
# USE_AMP = True

# # ❗ New W&B project & run base (to keep things separate)
# PROJECT = "adv-dl-p2-deberta-midf1"
# BASE_RUN_NAME = "microsoft/mdeberta-v3-base_full_ex_4_midf1"

# TRIALS = 20
# SEED = 42
# DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

# def set_seed(seed=42):
#     random.seed(seed); np.random.seed(seed)
#     torch.manual_seed(seed); torch.cuda.manual_seed_all(seed)
#     torch.backends.cudnn.deterministic = True
#     torch.backends.cudnn.benchmark = False

# set_seed(SEED)

# # ---- GPU perf toggles (Windows-safe) ----
# torch.backends.cuda.matmul.allow_tf32 = True
# torch.backends.cudnn.allow_tf32 = True
# try:
#     torch.set_float32_matmul_precision("high")
# except Exception:
#     pass

# # -------------------------
# # Label mapping (5-way sentiment)
# # -------------------------
# CANON = {
#     "extremely negative": "extremely negative",
#     "negative": "negative",
#     "neutral": "neutral",
#     "positive": "positive",
#     "extremely positive": "extremely positive",
# }
# ORDER = ["extremely negative","negative","neutral","positive","extremely positive"]
# LABEL2ID = {lab: i for i, lab in enumerate(ORDER)}
# ID2LABEL = {i: lab for lab, i in LABEL2ID.items()}

# def normalize_label(s: str) -> str:
#     s = str(s).strip().lower()
#     s = s.replace("very negative", "extremely negative")
#     s = s.replace("very positive", "extremely positive")
#     s = s.replace("extreme negative", "extremely negative")
#     s = s.replace("extreme positive", "extremely positive")
#     return CANON.get(s, s)

# # -------------------------
# # Expect df_train, df_test in memory
# # -------------------------
# assert "OriginalTweet" in df_train.columns and "Sentiment" in df_train.columns, "df_train missing required columns"
# assert "OriginalTweet" in df_test.columns and "Sentiment" in df_test.columns, "df_test missing required columns"

# def prep_df(df: pd.DataFrame) -> pd.DataFrame:
#     df = df.copy()
#     df = df.dropna(subset=["OriginalTweet", "Sentiment"])
#     df["text"] = df["OriginalTweet"].astype(str).str.strip()
#     df["label_name"] = df["Sentiment"].apply(normalize_label)
#     df = df[df["label_name"].isin(ORDER)].reset_index(drop=True)
#     df["label"] = df["label_name"].map(LABEL2ID)
#     return df[["text", "label", "label_name"]]

# dftrain_ = prep_df(df_train)
# dftest_  = prep_df(df_test)

# train_df, val_df = train_test_split(
#     dftrain_, test_size=0.1, stratify=dftrain_["label"], random_state=SEED
# )
# print(f"Train/Val/Test sizes: {len(train_df)}/{len(val_df)}/{len(dftest_)}")

# # # --- class weights for mid-class emphasis (computed from training split) ---
# # _counts = Counter(train_df["label"].tolist())
# # _num_classes = len(ORDER)
# # _total = sum(_counts.values())
# # # inverse-freq: N / (K * n_c)
# # CLASS_WEIGHTS = torch.tensor(
# #     [_total / (_num_classes * _counts[i]) for i in range(_num_classes)],
# #     dtype=torch.float,
# #     device=DEVICE,
# # )
# # --- class weights with explicit mid-class boost ---
# from collections import Counter
# import numpy as np
# import torch

# _counts = Counter(train_df["label"].tolist())
# _num_classes = len(ORDER)
# _total = sum(_counts.values())

# # inverse frequency baseline: N / (K * n_c)
# inv = np.array([_total / (_num_classes * max(1, _counts.get(i, 0))) for i in range(_num_classes)],
#                dtype=np.float32)

# # indices
# IDX_EXT_NEG = LABEL2ID["extremely negative"]
# IDX_NEG     = LABEL2ID["negative"]
# IDX_NEU     = LABEL2ID["neutral"]
# IDX_POS     = LABEL2ID["positive"]
# IDX_EXT_POS = LABEL2ID["extremely positive"]

# # multiplicative boost: mids > extremes (tune 1.25–2.0 as you like)
# MID_BOOST = 1.5
# mult = np.ones(_num_classes, dtype=np.float32)
# mult[[IDX_NEG, IDX_NEU, IDX_POS]] = MID_BOOST   # boost the mid classes

# weights = inv * mult

# # optional: renormalize so the mean weight is 1.0 (keeps loss scale stable)
# weights = weights * (_num_classes / weights.sum())

# CLASS_WEIGHTS = torch.tensor(weights, dtype=torch.float, device=DEVICE)

# print("[loss] Using class-weighted CrossEntropy; weights per label:")
# print({ID2LABEL[i]: float(CLASS_WEIGHTS[i].item()) for i in range(_num_classes)})


# # -------------------------
# # Dataset & Collator
# # -------------------------
# class TweetDataset(Dataset):
#     def __init__(self, df: pd.DataFrame, tokenizer: transformers.PreTrainedTokenizerBase, max_len: int):
#         self.texts = df["text"].tolist()
#         self.labels = df["label"].tolist()
#         self.tok = tokenizer
#         self.max_len = max_len
#     def __len__(self): return len(self.texts)
#     def __getitem__(self, idx):
#         enc = self.tok(self.texts[idx], truncation=True, max_length=self.max_len, padding=False)
#         enc["labels"] = self.labels[idx]
#         return {k: torch.tensor(v) for k, v in enc.items()}

# tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)
# train_ds = TweetDataset(train_df, tokenizer, MAX_LEN)
# val_ds   = TweetDataset(val_df, tokenizer, MAX_LEN)
# test_ds  = TweetDataset(dftest_, tokenizer, MAX_LEN)

# BATCH_SIZE=16
# # ---- pad_to_multiple_of=8 for Tensor Cores; Windows: workers=0 is often faster ----
# collate_fn = DataCollatorWithPadding(tokenizer=tokenizer, pad_to_multiple_of=8)
# train_loader = DataLoader(train_ds, batch_size=BATCH_SIZE, shuffle=True,  collate_fn=collate_fn, num_workers=0, pin_memory=True, persistent_workers=False)
# val_loader   = DataLoader(val_ds,   batch_size=BATCH_SIZE, shuffle=False, collate_fn=collate_fn, num_workers=0, pin_memory=True, persistent_workers=False)
# test_loader  = DataLoader(test_ds,  batch_size=BATCH_SIZE, shuffle=False, collate_fn=collate_fn, num_workers=0, pin_memory=True, persistent_workers=False)

# # -------------------------
# # Model & Freeze/Unfreeze strategy
# # -------------------------
# def build_model(num_unfreeze_last_layers: int = 4):
#     model = AutoModelForSequenceClassification.from_pretrained(
#         MODEL_NAME, num_labels=len(ORDER), id2label=ID2LABEL, label2id=LABEL2ID
#     )
#     base = getattr(model, "roberta", None) or getattr(model, "bert", None) or getattr(model, "deberta", None)
#     if base is not None:
#         for p in base.parameters(): p.requires_grad = False
#         if hasattr(base, "encoder") and hasattr(base.encoder, "layer"):
#             k = num_unfreeze_last_layers
#             if k > 0:
#                 for layer in base.encoder.layer[-k:]:
#                     for p in layer.parameters(): p.requires_grad = True
#     for p in model.classifier.parameters(): p.requires_grad = True
#     return model.to(DEVICE)

# # -------------------------
# # Train / Eval utilities
# # -------------------------
# def get_optimizer_scheduler(model, num_training_steps: int, lr: float, weight_decay: float):
#     no_decay = ["bias", "LayerNorm.weight"]
#     optimizer_grouped_parameters = [
#         {"params": [p for n, p in model.named_parameters() if p.requires_grad and not any(nd in n for nd in no_decay)], "weight_decay": weight_decay},
#         {"params": [p for n, p in model.named_parameters() if p.requires_grad and any(nd in n for nd in no_decay)],  "weight_decay": 0.0},
#     ]
#     # try fused AdamW on CUDA (faster step) — falls back if unavailable
#     try:
#         optimizer = torch.optim.AdamW(optimizer_grouped_parameters, lr=lr, weight_decay=weight_decay, fused=(DEVICE=="cuda"))
#     except TypeError:
#         optimizer = torch.optim.AdamW(optimizer_grouped_parameters, lr=lr, weight_decay=weight_decay)
#     num_warmup = int(num_training_steps * WARMUP_RATIO)
#     scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=num_warmup, num_training_steps=num_training_steps)
#     return optimizer, scheduler

# MIDS = [LABEL2ID["negative"], LABEL2ID["neutral"], LABEL2ID["positive"]]

# def evaluate(model, loader) -> Dict[str, float]:
#     model.eval()
#     preds, labels = [], []
#     with torch.no_grad():
#         for batch in loader:
#             batch = {k: v.to(DEVICE, non_blocking=True) for k, v in batch.items()}
#             # AMP autocast for faster eval math
#             with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
#                           enabled=(DEVICE == "cuda" and USE_AMP)):
#                 logits = model(**batch).logits
#             preds.extend(torch.argmax(logits, dim=-1).detach().cpu().tolist())
#             labels.extend(batch["labels"].detach().cpu().tolist())
#     acc = accuracy_score(labels, preds)
#     p, r, f1, _ = precision_recall_fscore_support(labels, preds, average="macro", zero_division=0)
#     # extra: mid-class F1 (negative/neutral/positive)
#     p_mid, r_mid, f1_mid, _ = precision_recall_fscore_support(
#         labels, preds, labels=MIDS, average="macro", zero_division=0
#     )
#     return {"acc": acc, "precision": p, "recall": r, "f1": f1, "f1_mid": f1_mid}

# def train_one_run(hp: Dict) -> Tuple[str, Dict[str, float]]:
#     """
#     hp keys: run_name, num_unfreeze_last_layers, lr, weight_decay, epochs, patience, trial_number
#     """
#     run_name = hp["run_name"]
#     num_unfreeze = int(hp["num_unfreeze_last_layers"])
#     lr = float(hp["lr"])
#     wd = float(hp["weight_decay"])
#     epochs   = int(hp.get("epochs",   FIXED_EPOCHS))
#     patience = int(hp.get("patience", FIXED_PATIENCE))
#     model = build_model(num_unfreeze)
#     total_steps = int(math.ceil(len(train_loader) * epochs))
#     optimizer, scheduler = get_optimizer_scheduler(model, total_steps, lr, wd)

#     scaler = GradScaler(enabled=(DEVICE == "cuda" and USE_AMP))
#     best_metric = -1.0
#     no_improve = 0

#     # ❗ save to a different folder + name to avoid collisions
#     safe_run_name = run_name.replace("/", "__").replace("\\", "__")
#     ckpt_dir = "checkpoints_midf1"
#     os.makedirs(ckpt_dir, exist_ok=True)
#     best_path = os.path.join(ckpt_dir, f"best_midf1_{safe_run_name}.pt")

#     wandb_run = wandb.init(
#         project=PROJECT,
#         name=run_name,
#         config={
#             "model": MODEL_NAME,
#             "max_len": MAX_LEN,
#             "batch_size": BATCH_SIZE,
#             "epochs": epochs,
#             "lr": lr,
#             "weight_decay": wd,
#             "warmup_ratio": WARMUP_RATIO,
#             "grad_clip": GRAD_CLIP,
#             "num_unfreeze_last_layers": num_unfreeze,
#             "trial_number": hp.get("trial_number", -1),
#             "suggested_batch_size": hp.get("batch_size", BATCH_SIZE),
#         },
#         reinit=True,
#     )

#     # nicer W&B charts
#     wandb.define_metric("epoch")
#     wandb.define_metric("step")
#     wandb.define_metric("train/*", step_metric="step")
#     wandb.define_metric("val/*",   step_metric="epoch")

#     # print + log trainable params
#     total_params     = sum(p.numel() for p in model.parameters())
#     trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
#     print(f"Trainable params: {trainable_params:,} / {total_params:,} "
#           f"({100.0*trainable_params/total_params:.2f}%) ; unfreeze_last_k={num_unfreeze}")
#     wandb.log({"params/total": total_params,
#                "params/trainable": trainable_params,
#                "params/ratio": trainable_params/max(1,total_params)}, step=0)

#     for epoch in range(epochs):
#         model.train()
#         t0 = time.time()
#         running_loss = 0.0

#         for step, batch in enumerate(train_loader):
#             batch = {k: v.to(DEVICE, non_blocking=True) for k, v in batch.items()}
#             labels = batch.pop("labels")  # we compute weighted loss ourselves

#             optimizer.zero_grad(set_to_none=True)
#             # use BF16 if supported; else FP16
#             with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
#                           enabled=(DEVICE == "cuda" and USE_AMP)):
#                 outputs = model(**batch)
#                 logits = outputs.logits
#                 # weighted CE (+ smoothing if available)
#                 try:
#                     loss = F.cross_entropy(logits, labels, weight=CLASS_WEIGHTS, label_smoothing=0.05)
#                 except TypeError:
#                     loss = F.cross_entropy(logits, labels, weight=CLASS_WEIGHTS)

#             scaler.scale(loss).backward()
#             if GRAD_CLIP is not None:
#                 scaler.unscale_(optimizer)
#                 torch.nn.utils.clip_grad_norm_(model.parameters(), GRAD_CLIP)
#             scaler.step(optimizer); scaler.update(); scheduler.step()
#             running_loss += loss.item()

#             if step % 20 == 0:
#                 wandb.log({"train/loss": loss.item(), "step": step + 1, "epoch": epoch + 1})

#             # periodic console + throughput log (about 10x per epoch)
#             if step % max(1, len(train_loader)//10) == 0 or step == 1:
#                 avg_loss = running_loss / max(1, (step + 1))
#                 elapsed  = time.time() - t0
#                 items    = (step + 1) * BATCH_SIZE
#                 itps     = items / max(elapsed, 1e-6)
#                 print(f"[e{epoch+1} b{step+1}/{len(train_loader)}] loss={loss.item():.4f} avg={avg_loss:.4f} it/s={itps:.1f}")
#                 wandb.log({"train/avg_loss_so_far": avg_loss,
#                            "train/items_per_sec": itps,
#                            "step": (epoch * len(train_loader)) + (step + 1),
#                            "epoch": epoch + 1})

#         # epoch-end validation
#         val_metrics = evaluate(model, val_loader)
#         elapsed = time.time() - t0

#         epoch_loss = running_loss / max(1, len(train_loader))
#         current_lr = scheduler.get_last_lr()[0]
#         wandb.log({
#             "train/epoch_loss": epoch_loss,
#             "val/acc": val_metrics["acc"],
#             "val/precision": val_metrics["precision"],
#             "val/recall": val_metrics["recall"],
#             "val/f1": val_metrics["f1"],
#             "val/mid_f1": val_metrics["f1_mid"],   # extra log (format unchanged elsewhere)
#             "lr": current_lr,
#             "time/epoch_sec": elapsed,
#             "epoch": epoch + 1,
#         })

#         # Early stopping on mid-class F1 (prints stay the same)
#         target_metric = val_metrics["f1_mid"]
#         if target_metric > best_metric:
#             best_metric = target_metric
#             torch.save(model.state_dict(), best_path)
#             no_improve = 0
#             wandb_run.summary["best_val_f1"] = best_metric  # kept same key for compatibility
#             wandb_run.summary["best_val_mid_f1"] = best_metric
#             wandb_run.summary["best_checkpoint_path"] = best_path
#             wandb.log({"val/best_f1_so_far": best_metric, "val/best_epoch": epoch + 1})
#         else:
#             no_improve += 1
#             if no_improve >= patience:
#                 print(f"Early stopping at epoch {epoch+1}")
#                 break

#         # console print line unchanged
#         print(f"Epoch {epoch+1}/{epochs} | "
#               f"loss={epoch_loss:.4f} | "
#               f"val_acc={val_metrics['acc']:.4f} | val_f1={val_metrics['f1']:.4f} | time={elapsed:.1f}s")

#     wandb.finish()

#     # Load best and return path + metrics on val for reference
#     model.load_state_dict(torch.load(best_path, map_location=DEVICE))
#     final_val = evaluate(model, val_loader)

#     # store final val in W&B summary for quick sorting
#     if wandb.run is not None:
#         wandb.run.summary["final_val_acc"] = final_val["acc"]
#         wandb.run.summary["final_val_precision"] = final_val["precision"]
#         wandb.run.summary["final_val_recall"] = final_val["recall"]
#         wandb.run.summary["final_val_f1"] = final_val["f1"]
#         wandb.run.summary["final_val_mid_f1"] = final_val["f1_mid"]

#     return best_path, final_val

# # -------------------------
# # Optuna hyperparameter tuning (ALWAYS ON)
# # -------------------------

# # Constants
# FIXED_EPOCHS = 12
# FIXED_PATIENCE = 4

# def objective(trial: optuna.trial.Trial):
#     params = {
#         "run_name": f"{BASE_RUN_NAME}_optuna_trial_{trial.number}",
#         "num_unfreeze_last_layers": trial.suggest_int("num_unfreeze_last_layers", 8, 12),
#         "lr": trial.suggest_float("lr", 8e-5, 1e-3, log=True),
#         "weight_decay": trial.suggest_float("weight_decay", 1e-6, 1e-4, log=True),
#         "batch_size": trial.suggest_categorical("batch_size", [4, 8, 16, 32, 64]),
#         "epochs": FIXED_EPOCHS,
#         "patience": FIXED_PATIENCE,
#         "trial_number": trial.number,
#     }
#     path, val_metrics = train_one_run(params)
#     # console visibility per trial (unchanged)
#     print(f"[Trial {trial.number}] f1={val_metrics['f1']:.4f} | "
#           f"unfreeze_k={params['num_unfreeze_last_layers']} lr={params['lr']:.2e} "
#           f"wd={params['weight_decay']:.1e} suggested_bs={params['batch_size']}")
#     # report intermediate value for pruning if enabled (keep macro f1 for study objective)
#     trial.report(val_metrics["f1"], step=1)
#     return val_metrics["f1"]

# study = optuna.create_study(direction="maximize")
# study.optimize(objective, n_trials=TRIALS, show_progress_bar=True)
# print("Best trial:", study.best_trial.number, "F1:", study.best_value)
# best_params = {"run_name": f"{BASE_RUN_NAME}_best_optuna", **study.best_trial.params}

# # also persist best params to a different folder/name
# os.makedirs("checkpoints_midf1", exist_ok=True)
# with open(os.path.join("checkpoints_midf1", "best_hparams_optuna_midf1.json"), "w") as f:
#     json.dump({**best_params, "epochs": FIXED_EPOCHS, "patience": FIXED_PATIENCE}, f, indent=2)


In [None]:
# import json, os
# os.makedirs("checkpoints", exist_ok=True)

# # best hparams from Optuna (only suggested ones live here)
# best_hparams = study.best_trial.params

# # if your train_one_run expects epochs/patience and they were fixed (not suggested),
# # add them explicitly:
# best_hparams_complete = {
#     **best_hparams,
#     "epochs": FIXED_EPOCHS,       # or whatever you used
#     "patience": FIXED_PATIENCE,   # "
# }
# # hp_path = os.path.join("checkpoints", "best_hparams_optuna.json")
# hp_path = os.path.join("checkpoints", "best_hparams_optuna_2.json")
# with open(hp_path, "w") as f:
#     json.dump(best_hparams_complete, f, indent=2)
# print("Saved best hparams to:", hp_path)
# print("Best trial number:", study.best_trial.number, " value:", study.best_value)

In [2]:
from sklearn.model_selection import train_test_split
# Load CSVs (your files have columns: ['UserName','ScreenName','Location','TweetAt','OriginalTweet','Sentiment'])
TRAIN_CSV = "Corona_NLP_train_cleaned_translated.csv"   # or "Corona_NLP_train.csv"
TEST_CSV  = "Corona_NLP_test_cleaned_translated.csv"    # or "Corona_NLP_test.csv"


df_train = pd.read_csv(TRAIN_CSV, encoding="utf-8", engine="python")
df_test  = pd.read_csv(TEST_CSV,  encoding="utf-8", engine="python")

In [3]:
# =========================
# ADV DL – Part B: Monolingual baseline (RoBERTa) – Exercise-4 style
# Custom loop + early stopping + W&B + Optuna ONLY; freeze base, unfreeze last k layers
# Uses df_train / df_test with columns: OriginalTweet (str), Sentiment (str)
# =========================

import os, math, random, time, json
from typing import Dict, List, Tuple

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, classification_report, confusion_matrix
import torch
from torch.utils.data import Dataset, DataLoader
from torch.cuda.amp import autocast, GradScaler
import torch.nn.functional as F
from collections import Counter

# ---- deps ----
# !pip -q install transformers==4.43.3 optuna==3.6.1 wandb==0.17.5 >/dev/null

import transformers
from transformers import (
    AutoTokenizer, AutoModelForSequenceClassification,
    DataCollatorWithPadding, get_linear_schedule_with_warmup
)
import os
os.environ["TRANSFORMERS_NO_TF"] = "1"
os.environ["TRANSFORMERS_NO_FLAX"] = "1"

import optuna
import wandb

# -------------------------
# Constants (no CFG, Optuna-only workflow)
# -------------------------
MODEL_NAME = "microsoft/mdeberta-v3-base"
MAX_LEN = 512
BATCH_SIZE = 16
WARMUP_RATIO = 0.06
GRAD_CLIP = 1.0
USE_AMP = True

# ❗ New W&B project & run base (to keep things separate)
PROJECT = "adv-dl-p2-deberta-midf1_new_study"
BASE_RUN_NAME = "microsoft/mdeberta-v3-base_full_ex_4_midf1"

TRIALS = 20
SEED = 42
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

def set_seed(seed=42):
    random.seed(seed); np.random.seed(seed)
    torch.manual_seed(seed); torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

set_seed(SEED)

# ---- GPU perf toggles (Windows-safe) ----
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True
try:
    torch.set_float32_matmul_precision("high")
except Exception:
    pass

# -------------------------
# Label mapping (5-way sentiment)
# -------------------------
CANON = {
    "extremely negative": "extremely negative",
    "negative": "negative",
    "neutral": "neutral",
    "positive": "positive",
    "extremely positive": "extremely positive",
}
ORDER = ["extremely negative","negative","neutral","positive","extremely positive"]
LABEL2ID = {lab: i for i, lab in enumerate(ORDER)}
ID2LABEL = {i: lab for lab, i in LABEL2ID.items()}

def normalize_label(s: str) -> str:
    s = str(s).strip().lower()
    s = s.replace("very negative", "extremely negative")
    s = s.replace("very positive", "extremely positive")
    s = s.replace("extreme negative", "extremely negative")
    s = s.replace("extreme positive", "extremely positive")
    return CANON.get(s, s)

# -------------------------
# Expect df_train, df_test in memory
# -------------------------
assert "OriginalTweet" in df_train.columns and "Sentiment" in df_train.columns, "df_train missing required columns"
assert "OriginalTweet" in df_test.columns and "Sentiment" in df_test.columns, "df_test missing required columns"

def prep_df(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    df = df.dropna(subset=["OriginalTweet", "Sentiment"])
    df["text"] = df["OriginalTweet"].astype(str).str.strip()
    df["label_name"] = df["Sentiment"].apply(normalize_label)
    df = df[df["label_name"].isin(ORDER)].reset_index(drop=True)
    df["label"] = df["label_name"].map(LABEL2ID)
    return df[["text", "label", "label_name"]]

dftrain_ = prep_df(df_train)
dftest_  = prep_df(df_test)

train_df, val_df = train_test_split(
    dftrain_, test_size=0.1, stratify=dftrain_["label"], random_state=SEED
)
print(f"Train/Val/Test sizes: {len(train_df)}/{len(val_df)}/{len(dftest_)}")



IDX_EXT_NEG = LABEL2ID["extremely negative"]
IDX_NEG     = LABEL2ID["negative"]
IDX_NEU     = LABEL2ID["neutral"]
IDX_POS     = LABEL2ID["positive"]
IDX_EXT_POS = LABEL2ID["extremely positive"]

_num_classes = len(ORDER)

# knobs: make mids heavier than extremes
MID_WEIGHT = 1.5   # applied to negative/neutral/positive
EXT_WEIGHT = 0.8   # applied to extremely negative/positive (less than MID_WEIGHT)

weights = np.full(_num_classes, EXT_WEIGHT, dtype=np.float32)
weights[[IDX_NEG, IDX_NEU, IDX_POS]] = MID_WEIGHT

# normalize so average weight is 1.0 (optional but recommended)
weights = weights * (_num_classes / weights.sum())

CLASS_WEIGHTS = torch.tensor(weights, dtype=torch.float, device=DEVICE)

print("[loss] Using tiered class weights; per label:")
print({ID2LABEL[i]: float(CLASS_WEIGHTS[i].item()) for i in range(_num_classes)})

# -------------------------
# Dataset & Collator
# -------------------------
class TweetDataset(Dataset):
    def __init__(self, df: pd.DataFrame, tokenizer: transformers.PreTrainedTokenizerBase, max_len: int):
        self.texts = df["text"].tolist()
        self.labels = df["label"].tolist()
        self.tok = tokenizer
        self.max_len = max_len
    def __len__(self): return len(self.texts)
    def __getitem__(self, idx):
        enc = self.tok(self.texts[idx], truncation=True, max_length=self.max_len, padding=False)
        enc["labels"] = self.labels[idx]
        return {k: torch.tensor(v) for k, v in enc.items()}

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)
train_ds = TweetDataset(train_df, tokenizer, MAX_LEN)
val_ds   = TweetDataset(val_df, tokenizer, MAX_LEN)
test_ds  = TweetDataset(dftest_, tokenizer, MAX_LEN)

BATCH_SIZE=16
# ---- pad_to_multiple_of=8 for Tensor Cores; Windows: workers=0 is often faster ----
collate_fn = DataCollatorWithPadding(tokenizer=tokenizer, pad_to_multiple_of=8)
train_loader = DataLoader(train_ds, batch_size=BATCH_SIZE, shuffle=True,  collate_fn=collate_fn, num_workers=0, pin_memory=True, persistent_workers=False)
val_loader   = DataLoader(val_ds,   batch_size=BATCH_SIZE, shuffle=False, collate_fn=collate_fn, num_workers=0, pin_memory=True, persistent_workers=False)
test_loader  = DataLoader(test_ds,  batch_size=BATCH_SIZE, shuffle=False, collate_fn=collate_fn, num_workers=0, pin_memory=True, persistent_workers=False)

# -------------------------
# Model & Freeze/Unfreeze strategy
# -------------------------
def build_model(num_unfreeze_last_layers: int = 4):
    model = AutoModelForSequenceClassification.from_pretrained(
        MODEL_NAME, num_labels=len(ORDER), id2label=ID2LABEL, label2id=LABEL2ID
    )
    base = getattr(model, "roberta", None) or getattr(model, "bert", None) or getattr(model, "deberta", None)
    if base is not None:
        for p in base.parameters(): p.requires_grad = False
        if hasattr(base, "encoder") and hasattr(base.encoder, "layer"):
            k = num_unfreeze_last_layers
            if k > 0:
                for layer in base.encoder.layer[-k:]:
                    for p in layer.parameters(): p.requires_grad = True
    for p in model.classifier.parameters(): p.requires_grad = True
    return model.to(DEVICE)

# -------------------------
# Train / Eval utilities
# -------------------------
def get_optimizer_scheduler(model, num_training_steps: int, lr: float, weight_decay: float):
    no_decay = ["bias", "LayerNorm.weight"]
    optimizer_grouped_parameters = [
        {"params": [p for n, p in model.named_parameters() if p.requires_grad and not any(nd in n for nd in no_decay)], "weight_decay": weight_decay},
        {"params": [p for n, p in model.named_parameters() if p.requires_grad and any(nd in n for nd in no_decay)],  "weight_decay": 0.0},
    ]
    # try fused AdamW on CUDA (faster step) — falls back if unavailable
    try:
        optimizer = torch.optim.AdamW(optimizer_grouped_parameters, lr=lr, weight_decay=weight_decay, fused=(DEVICE=="cuda"))
    except TypeError:
        optimizer = torch.optim.AdamW(optimizer_grouped_parameters, lr=lr, weight_decay=weight_decay)
    num_warmup = int(num_training_steps * WARMUP_RATIO)
    scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=num_warmup, num_training_steps=num_training_steps)
    return optimizer, scheduler

MIDS = [LABEL2ID["negative"], LABEL2ID["neutral"], LABEL2ID["positive"]]

def evaluate(model, loader) -> Dict[str, float]:
    model.eval()
    preds, labels = [], []
    with torch.no_grad():
        for batch in loader:
            batch = {k: v.to(DEVICE, non_blocking=True) for k, v in batch.items()}
            # AMP autocast for faster eval math
            with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
                          enabled=(DEVICE == "cuda" and USE_AMP)):
                logits = model(**batch).logits
            preds.extend(torch.argmax(logits, dim=-1).detach().cpu().tolist())
            labels.extend(batch["labels"].detach().cpu().tolist())
    acc = accuracy_score(labels, preds)
    p, r, f1, _ = precision_recall_fscore_support(labels, preds, average="macro", zero_division=0)
    # extra: mid-class F1 (negative/neutral/positive)
    p_mid, r_mid, f1_mid, _ = precision_recall_fscore_support(
        labels, preds, labels=MIDS, average="macro", zero_division=0
    )
    return {"acc": acc, "precision": p, "recall": r, "f1": f1, "f1_mid": f1_mid}

def train_one_run(hp: Dict) -> Tuple[str, Dict[str, float]]:
    """
    hp keys: run_name, num_unfreeze_last_layers, lr, weight_decay, epochs, patience, trial_number
    """
    run_name = hp["run_name"]
    num_unfreeze = int(hp["num_unfreeze_last_layers"])
    lr = float(hp["lr"])
    wd = float(hp["weight_decay"])
    epochs   = int(hp.get("epochs",   FIXED_EPOCHS))
    patience = int(hp.get("patience", FIXED_PATIENCE))
    model = build_model(num_unfreeze)
    total_steps = int(math.ceil(len(train_loader) * epochs))
    optimizer, scheduler = get_optimizer_scheduler(model, total_steps, lr, wd)

    scaler = GradScaler(enabled=(DEVICE == "cuda" and USE_AMP))
    best_metric = -1.0
    no_improve = 0

    # ❗ save to a different folder + name to avoid collisions
    safe_run_name = run_name.replace("/", "__").replace("\\", "__")
    ckpt_dir = "checkpoints_midf1"
    os.makedirs(ckpt_dir, exist_ok=True)
    best_path = os.path.join(ckpt_dir, f"best_midf1_{safe_run_name}.pt")

    wandb_run = wandb.init(
        project=PROJECT,
        name=run_name,
        config={
            "model": MODEL_NAME,
            "max_len": MAX_LEN,
            "batch_size": BATCH_SIZE,
            "epochs": epochs,
            "lr": lr,
            "weight_decay": wd,
            "warmup_ratio": WARMUP_RATIO,
            "grad_clip": GRAD_CLIP,
            "num_unfreeze_last_layers": num_unfreeze,
            "trial_number": hp.get("trial_number", -1),
            "suggested_batch_size": hp.get("batch_size", BATCH_SIZE),
        },
        reinit=True,
    )

    # nicer W&B charts
    wandb.define_metric("epoch")
    wandb.define_metric("step")
    wandb.define_metric("train/*", step_metric="step")
    wandb.define_metric("val/*",   step_metric="epoch")

    # print + log trainable params
    total_params     = sum(p.numel() for p in model.parameters())
    trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
    print(f"Trainable params: {trainable_params:,} / {total_params:,} "
          f"({100.0*trainable_params/total_params:.2f}%) ; unfreeze_last_k={num_unfreeze}")
    wandb.log({"params/total": total_params,
               "params/trainable": trainable_params,
               "params/ratio": trainable_params/max(1,total_params)}, step=0)

    for epoch in range(epochs):
        model.train()
        t0 = time.time()
        running_loss = 0.0

        for step, batch in enumerate(train_loader):
            batch = {k: v.to(DEVICE, non_blocking=True) for k, v in batch.items()}
            labels = batch.pop("labels")  # we compute weighted loss ourselves

            optimizer.zero_grad(set_to_none=True)
            # use BF16 if supported; else FP16
            with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
                          enabled=(DEVICE == "cuda" and USE_AMP)):
                outputs = model(**batch)
                logits = outputs.logits
                # weighted CE (+ smoothing if available)
                try:
                    loss = F.cross_entropy(logits, labels, weight=CLASS_WEIGHTS, label_smoothing=0.05)
                except TypeError:
                    loss = F.cross_entropy(logits, labels, weight=CLASS_WEIGHTS)

            scaler.scale(loss).backward()
            if GRAD_CLIP is not None:
                scaler.unscale_(optimizer)
                torch.nn.utils.clip_grad_norm_(model.parameters(), GRAD_CLIP)
            scaler.step(optimizer); scaler.update(); scheduler.step()
            running_loss += loss.item()

            if step % 20 == 0:
                wandb.log({"train/loss": loss.item(), "step": step + 1, "epoch": epoch + 1})

            # periodic console + throughput log (about 10x per epoch)
            if step % max(1, len(train_loader)//10) == 0 or step == 1:
                avg_loss = running_loss / max(1, (step + 1))
                elapsed  = time.time() - t0
                items    = (step + 1) * BATCH_SIZE
                itps     = items / max(elapsed, 1e-6)
                print(f"[e{epoch+1} b{step+1}/{len(train_loader)}] loss={loss.item():.4f} avg={avg_loss:.4f} it/s={itps:.1f}")
                wandb.log({"train/avg_loss_so_far": avg_loss,
                           "train/items_per_sec": itps,
                           "step": (epoch * len(train_loader)) + (step + 1),
                           "epoch": epoch + 1})

        # epoch-end validation
        val_metrics = evaluate(model, val_loader)
        elapsed = time.time() - t0

        epoch_loss = running_loss / max(1, len(train_loader))
        current_lr = scheduler.get_last_lr()[0]
        wandb.log({
            "train/epoch_loss": epoch_loss,
            "val/acc": val_metrics["acc"],
            "val/precision": val_metrics["precision"],
            "val/recall": val_metrics["recall"],
            "val/f1": val_metrics["f1"],
            "val/mid_f1": val_metrics["f1_mid"],   # extra log (format unchanged elsewhere)
            "lr": current_lr,
            "time/epoch_sec": elapsed,
            "epoch": epoch + 1,
        })

        # Early stopping on mid-class F1 (prints stay the same)
        target_metric = val_metrics["f1_mid"]
        if target_metric > best_metric:
            best_metric = target_metric
            torch.save(model.state_dict(), best_path)
            no_improve = 0
            wandb_run.summary["best_val_f1"] = best_metric  # kept same key for compatibility
            wandb_run.summary["best_val_mid_f1"] = best_metric
            wandb_run.summary["best_checkpoint_path"] = best_path
            wandb.log({"val/best_f1_so_far": best_metric, "val/best_epoch": epoch + 1})
        else:
            no_improve += 1
            if no_improve >= patience:
                print(f"Early stopping at epoch {epoch+1}")
                break

        # console print line unchanged
        print(f"Epoch {epoch+1}/{epochs} | "
              f"loss={epoch_loss:.4f} | "
              f"val_acc={val_metrics['acc']:.4f} | val_f1={val_metrics['f1']:.4f} | time={elapsed:.1f}s")

    wandb.finish()

    # Load best and return path + metrics on val for reference
    model.load_state_dict(torch.load(best_path, map_location=DEVICE))
    final_val = evaluate(model, val_loader)

    # store final val in W&B summary for quick sorting
    if wandb.run is not None:
        wandb.run.summary["final_val_acc"] = final_val["acc"]
        wandb.run.summary["final_val_precision"] = final_val["precision"]
        wandb.run.summary["final_val_recall"] = final_val["recall"]
        wandb.run.summary["final_val_f1"] = final_val["f1"]
        wandb.run.summary["final_val_mid_f1"] = final_val["f1_mid"]

    return best_path, final_val

# -------------------------
# Optuna hyperparameter tuning (ALWAYS ON)
# -------------------------

# Constants
FIXED_EPOCHS = 7
FIXED_PATIENCE = 3

def objective(trial: optuna.trial.Trial):
    params = {
        "run_name": f"{BASE_RUN_NAME}_optuna_trial_{trial.number}",
        "num_unfreeze_last_layers": trial.suggest_int("num_unfreeze_last_layers", 4, 12),
        "lr": trial.suggest_float("lr", 1e-5, 1e-3, log=True),
        "weight_decay": trial.suggest_float("weight_decay", 1e-7, 1e-5, log=True),
        "batch_size": trial.suggest_categorical("batch_size", [4, 8, 16, 32, 64]),
        "epochs": FIXED_EPOCHS,
        "patience": FIXED_PATIENCE,
        "trial_number": trial.number,
    }
    path, val_metrics = train_one_run(params)
    # console visibility per trial (unchanged)
    print(f"[Trial {trial.number}] f1={val_metrics['f1']:.4f} | "
          f"unfreeze_k={params['num_unfreeze_last_layers']} lr={params['lr']:.2e} "
          f"wd={params['weight_decay']:.1e} suggested_bs={params['batch_size']}")
    # report intermediate value for pruning if enabled (keep macro f1 for study objective)
    trial.report(val_metrics["f1"], step=1)
    return val_metrics["f1"]

study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=TRIALS, show_progress_bar=True)
print("Best trial:", study.best_trial.number, "F1:", study.best_value)
best_params = {"run_name": f"{BASE_RUN_NAME}_best_optuna", **study.best_trial.params}

# also persist best params to a different folder/name
os.makedirs("checkpoints_midf1", exist_ok=True)
with open(os.path.join("checkpoints_midf1", "best_hparams_optuna_midf1.json"), "w") as f:
    json.dump({**best_params, "epochs": FIXED_EPOCHS, "patience": FIXED_PATIENCE}, f, indent=2)


Train/Val/Test sizes: 37039/4116/3798
[loss] Using tiered class weights; per label:
{'extremely negative': 0.6557376980781555, 'negative': 1.2295081615447998, 'neutral': 1.2295081615447998, 'positive': 1.2295081615447998, 'extremely positive': 0.6557376980781555}


[I 2025-08-17 17:48:02,852] A new study created in memory with name: no-name-03ead3fe-dcdf-4b18-bc81-53020fce6c05
  0%|          | 0/20 [00:00<?, ?it/s]Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/mdeberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  scaler = GradScaler(enabled=(DEVICE == "cuda" and USE_AMP))
[34m[1mwandb[0m: [32m[41mERROR[0m Failed to detect the name of this notebook. You can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33madishalit1[0m ([33madishalit1-tel-aviv-university[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Trainable params: 28,945,925 / 278,813,189 (10.38%) ; unfreeze_last_k=4


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e1 b1/2315] loss=1.6733 avg=1.6733 it/s=41.6
[e1 b2/2315] loss=1.6255 avg=1.6494 it/s=65.9
[e1 b232/2315] loss=1.5370 avg=1.5309 it/s=522.7
[e1 b463/2315] loss=1.3853 avg=1.4602 it/s=541.8
[e1 b694/2315] loss=1.0948 avg=1.3923 it/s=543.0
[e1 b925/2315] loss=1.0755 avg=1.3458 it/s=532.2
[e1 b1156/2315] loss=0.8271 avg=1.3049 it/s=529.5
[e1 b1387/2315] loss=0.9937 avg=1.2741 it/s=529.8
[e1 b1618/2315] loss=0.7183 avg=1.2491 it/s=529.8
[e1 b1849/2315] loss=1.2618 avg=1.2267 it/s=535.6
[e1 b2080/2315] loss=1.1084 avg=1.2071 it/s=540.1
[e1 b2311/2315] loss=1.3135 avg=1.1931 it/s=543.1


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 1/7 | loss=1.1930 | val_acc=0.6045 | val_f1=0.6184 | time=72.8s
[e2 b1/2315] loss=1.0361 avg=1.0361 it/s=574.6
[e2 b2/2315] loss=0.7388 avg=0.8875 it/s=607.7


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e2 b232/2315] loss=0.9810 avg=1.0016 it/s=548.4
[e2 b463/2315] loss=0.9185 avg=1.0022 it/s=538.3
[e2 b694/2315] loss=1.0811 avg=0.9982 it/s=536.7
[e2 b925/2315] loss=1.1991 avg=0.9994 it/s=539.2
[e2 b1156/2315] loss=1.0520 avg=0.9887 it/s=541.1
[e2 b1387/2315] loss=1.1179 avg=0.9828 it/s=545.9
[e2 b1618/2315] loss=0.7024 avg=0.9805 it/s=550.4
[e2 b1849/2315] loss=0.9609 avg=0.9753 it/s=553.9
[e2 b2080/2315] loss=1.1449 avg=0.9726 it/s=557.1
[e2 b2311/2315] loss=1.0524 avg=0.9694 it/s=557.8


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 2/7 | loss=0.9694 | val_acc=0.6540 | val_f1=0.6598 | time=71.1s
[e3 b1/2315] loss=0.9008 avg=0.9008 it/s=475.1
[e3 b2/2315] loss=1.0006 avg=0.9507 it/s=512.6


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e3 b232/2315] loss=0.8123 avg=0.9059 it/s=560.1
[e3 b463/2315] loss=0.9463 avg=0.9069 it/s=555.8
[e3 b694/2315] loss=0.8118 avg=0.9048 it/s=558.0
[e3 b925/2315] loss=0.8971 avg=0.8977 it/s=552.8
[e3 b1156/2315] loss=0.8305 avg=0.8912 it/s=552.1
[e3 b1387/2315] loss=0.8575 avg=0.8902 it/s=552.0
[e3 b1618/2315] loss=0.9606 avg=0.8869 it/s=553.6
[e3 b1849/2315] loss=0.8780 avg=0.8857 it/s=554.5
[e3 b2080/2315] loss=0.9512 avg=0.8838 it/s=556.4
[e3 b2311/2315] loss=0.6447 avg=0.8835 it/s=558.1


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 3/7 | loss=0.8835 | val_acc=0.6771 | val_f1=0.6883 | time=70.9s
[e4 b1/2315] loss=0.8827 avg=0.8827 it/s=475.6
[e4 b2/2315] loss=0.7565 avg=0.8196 it/s=584.8


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e4 b232/2315] loss=0.7485 avg=0.8111 it/s=578.4
[e4 b463/2315] loss=0.8027 avg=0.8259 it/s=581.7
[e4 b694/2315] loss=1.2966 avg=0.8357 it/s=582.5
[e4 b925/2315] loss=0.7232 avg=0.8380 it/s=583.2
[e4 b1156/2315] loss=0.7594 avg=0.8384 it/s=580.9
[e4 b1387/2315] loss=0.7920 avg=0.8375 it/s=562.6
[e4 b1618/2315] loss=0.6902 avg=0.8361 it/s=513.8
[e4 b1849/2315] loss=0.8842 avg=0.8365 it/s=492.1
[e4 b2080/2315] loss=0.7169 avg=0.8346 it/s=469.3
[e4 b2311/2315] loss=0.7919 avg=0.8334 it/s=466.4


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 4/7 | loss=0.8333 | val_acc=0.6878 | val_f1=0.6987 | time=84.0s
[e5 b1/2315] loss=0.6985 avg=0.6985 it/s=523.6
[e5 b2/2315] loss=0.7175 avg=0.7080 it/s=512.1


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e5 b232/2315] loss=0.5846 avg=0.7906 it/s=573.6
[e5 b463/2315] loss=0.5905 avg=0.7949 it/s=574.1
[e5 b694/2315] loss=0.7464 avg=0.7880 it/s=574.9
[e5 b925/2315] loss=0.8627 avg=0.7908 it/s=576.9
[e5 b1156/2315] loss=0.8003 avg=0.7881 it/s=578.7
[e5 b1387/2315] loss=0.9702 avg=0.7803 it/s=579.0
[e5 b1618/2315] loss=1.3750 avg=0.7793 it/s=578.9
[e5 b1849/2315] loss=0.7967 avg=0.7804 it/s=579.0
[e5 b2080/2315] loss=1.6527 avg=0.7832 it/s=579.8
[e5 b2311/2315] loss=1.0325 avg=0.7822 it/s=579.9


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 5/7 | loss=0.7822 | val_acc=0.7021 | val_f1=0.7124 | time=68.5s
[e6 b1/2315] loss=0.7811 avg=0.7811 it/s=530.9
[e6 b2/2315] loss=0.7376 avg=0.7593 it/s=590.3


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e6 b232/2315] loss=0.8131 avg=0.7399 it/s=562.7
[e6 b463/2315] loss=0.7439 avg=0.7363 it/s=532.3
[e6 b694/2315] loss=0.7031 avg=0.7400 it/s=532.4
[e6 b925/2315] loss=0.7046 avg=0.7426 it/s=528.0
[e6 b1156/2315] loss=0.5494 avg=0.7434 it/s=530.8
[e6 b1387/2315] loss=0.8384 avg=0.7432 it/s=537.2
[e6 b1618/2315] loss=0.7435 avg=0.7409 it/s=543.0
[e6 b1849/2315] loss=0.8034 avg=0.7402 it/s=547.1
[e6 b2080/2315] loss=0.6485 avg=0.7375 it/s=551.5
[e6 b2311/2315] loss=1.0068 avg=0.7374 it/s=551.6


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 6/7 | loss=0.7375 | val_acc=0.7102 | val_f1=0.7190 | time=71.8s
[e7 b1/2315] loss=0.9489 avg=0.9489 it/s=739.5
[e7 b2/2315] loss=0.5953 avg=0.7721 it/s=519.1


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e7 b232/2315] loss=0.3971 avg=0.7104 it/s=536.0
[e7 b463/2315] loss=0.3893 avg=0.7008 it/s=543.2
[e7 b694/2315] loss=0.9501 avg=0.7126 it/s=542.7
[e7 b925/2315] loss=0.7576 avg=0.7151 it/s=553.1
[e7 b1156/2315] loss=0.8048 avg=0.7120 it/s=559.6
[e7 b1387/2315] loss=0.7710 avg=0.7090 it/s=563.6
[e7 b1618/2315] loss=0.7235 avg=0.7059 it/s=567.7
[e7 b1849/2315] loss=0.7555 avg=0.7040 it/s=569.4
[e7 b2080/2315] loss=0.6464 avg=0.7071 it/s=569.1
[e7 b2311/2315] loss=0.7173 avg=0.7086 it/s=568.9


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
[34m[1mwandb[0m: [32m[41mERROR[0m The nbformat package was not found. It is required to save notebook history.


Epoch 7/7 | loss=0.7083 | val_acc=0.7199 | val_f1=0.7284 | time=69.7s


0,1
epoch,▁▁▁▂▂▂▂▂▂▂▂▃▃▃▃▃▃▃▃▃▅▅▅▅▆▆▆▆▆▆▇▇▇███████
lr,█▇▆▅▃▂▁
params/ratio,▁
params/total,▁
params/trainable,▁
step,▁▁▁▂▂▂▃▃▁▂▃▃▃▅▃▁▁▆▂▂▇▂█▂▂▃▂▂▃▁▁▂▂▂▂▃▃▃▂▃
time/epoch_sec,▃▂▂█▁▃▂
train/avg_loss_so_far,█▇▅▅▅▅▅▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁
train/epoch_loss,█▅▄▃▂▁▁
train/items_per_sec,▁▁▆▆▆▆▇▆▆▆▆▆▆▅▆▆▆▆▅▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆▆█▆▆▆

0,1
best_checkpoint_path,checkpoints_midf1\be...
best_val_f1,0.71433
best_val_mid_f1,0.71433
epoch,7
lr,0
params/ratio,0.10382
params/total,278813189
params/trainable,28945925
step,16201
time/epoch_sec,69.69054


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
Best trial: 0. Best value: 0.728393:   5%|▌         | 1/20 [08:45<2:46:29, 525.74s/it]

[Trial 0] f1=0.7284 | unfreeze_k=4 lr=5.30e-05 wd=4.2e-07 suggested_bs=16
[I 2025-08-17 17:56:48,587] Trial 0 finished with value: 0.7283928888695556 and parameters: {'num_unfreeze_last_layers': 4, 'lr': 5.2983095099764093e-05, 'weight_decay': 4.176723737706654e-07, 'batch_size': 16}. Best is trial 0 with value: 0.7283928888695556.


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/mdeberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  scaler = GradScaler(enabled=(DEVICE == "cuda" and USE_AMP))


Trainable params: 50,209,541 / 278,813,189 (18.01%) ; unfreeze_last_k=7
[e1 b1/2315] loss=1.5927 avg=1.5927 it/s=282.6
[e1 b2/2315] loss=1.6036 avg=1.5981 it/s=324.3


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e1 b232/2315] loss=1.6425 avg=1.4912 it/s=446.7
[e1 b463/2315] loss=1.4284 avg=1.3818 it/s=445.6
[e1 b694/2315] loss=1.3784 avg=1.3423 it/s=447.6
[e1 b925/2315] loss=1.3820 avg=1.3459 it/s=447.6
[e1 b1156/2315] loss=1.4607 avg=1.3760 it/s=452.1
[e1 b1387/2315] loss=1.5962 avg=1.3978 it/s=455.4
[e1 b1618/2315] loss=1.4588 avg=1.4137 it/s=458.5
[e1 b1849/2315] loss=1.5740 avg=1.4251 it/s=460.3
[e1 b2080/2315] loss=1.4836 avg=1.4328 it/s=462.0
[e1 b2311/2315] loss=1.3922 avg=1.4391 it/s=463.2


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 1/7 | loss=1.4390 | val_acc=0.2775 | val_f1=0.0869 | time=84.5s
[e2 b1/2315] loss=1.5984 avg=1.5984 it/s=415.1
[e2 b2/2315] loss=1.4512 avg=1.5248 it/s=540.0


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e2 b232/2315] loss=1.5860 avg=1.4878 it/s=469.7
[e2 b463/2315] loss=1.5286 avg=1.4901 it/s=456.8
[e2 b694/2315] loss=1.5053 avg=1.4940 it/s=456.9
[e2 b925/2315] loss=1.5490 avg=1.4965 it/s=455.8
[e2 b1156/2315] loss=1.4608 avg=1.4969 it/s=456.7
[e2 b1387/2315] loss=1.4401 avg=1.4976 it/s=460.2
[e2 b1618/2315] loss=1.5514 avg=1.4972 it/s=462.9
[e2 b1849/2315] loss=1.6076 avg=1.4981 it/s=465.2
[e2 b2080/2315] loss=1.7110 avg=1.4969 it/s=467.1
[e2 b2311/2315] loss=1.6137 avg=1.4972 it/s=467.0


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 2/7 | loss=1.4972 | val_acc=0.2775 | val_f1=0.0869 | time=83.8s
[e3 b1/2315] loss=1.5409 avg=1.5409 it/s=471.3
[e3 b2/2315] loss=1.3996 avg=1.4703 it/s=516.9


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e3 b232/2315] loss=1.4941 avg=1.4933 it/s=473.3
[e3 b463/2315] loss=1.5508 avg=1.4902 it/s=473.7
[e3 b694/2315] loss=1.5227 avg=1.4926 it/s=477.4
[e3 b925/2315] loss=1.4828 avg=1.4944 it/s=479.4
[e3 b1156/2315] loss=1.5496 avg=1.4951 it/s=480.7
[e3 b1387/2315] loss=1.6558 avg=1.4947 it/s=476.0
[e3 b1618/2315] loss=1.6504 avg=1.4964 it/s=466.0
[e3 b1849/2315] loss=1.4449 avg=1.4960 it/s=460.9
[e3 b2080/2315] loss=1.5094 avg=1.4960 it/s=458.2
[e3 b2311/2315] loss=1.5682 avg=1.4960 it/s=459.6


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 3/7 | loss=1.4961 | val_acc=0.2775 | val_f1=0.0869 | time=85.3s
[e4 b1/2315] loss=1.4526 avg=1.4526 it/s=478.1
[e4 b2/2315] loss=1.5177 avg=1.4851 it/s=510.8


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e4 b232/2315] loss=1.4132 avg=1.4895 it/s=473.7
[e4 b463/2315] loss=1.4076 avg=1.4942 it/s=472.0
[e4 b694/2315] loss=1.4747 avg=1.4962 it/s=464.3
[e4 b925/2315] loss=1.4289 avg=1.4980 it/s=456.9
[e4 b1156/2315] loss=1.5809 avg=1.4975 it/s=454.2
[e4 b1387/2315] loss=1.5313 avg=1.4981 it/s=451.4
[e4 b1618/2315] loss=1.4691 avg=1.4976 it/s=454.6
[e4 b1849/2315] loss=1.4441 avg=1.4962 it/s=457.3
[e4 b2080/2315] loss=1.4313 avg=1.4963 it/s=452.5
[e4 b2311/2315] loss=1.5769 avg=1.4958 it/s=448.4


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
[34m[1mwandb[0m: [32m[41mERROR[0m The nbformat package was not found. It is required to save notebook history.


Early stopping at epoch 4


0,1
epoch,▁▁▁▁▁▁▁▁▁▁▃▃▃▃▃▃▃▃▃▃▆▆▆▆▆▆▆▆████████████
lr,█▆▃▁
params/ratio,▁
params/total,▁
params/trainable,▁
step,▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▂▄▂▂▂▃▁▁▁▂▂▂▂▅▆▂▃▃▁▁▂▂▂▂█
time/epoch_sec,▂▁▄█
train/avg_loss_so_far,██▅▂▁▂▃▃▃▃█▆▅▅▅▅▅▅▅▅▆▄▅▅▅▅▅▅▅▅▄▅▅▅▅▅▅▅▅▅
train/epoch_loss,▁███
train/items_per_sec,▁▂▅▅▅▆▆▆▆▆▅█▆▆▆▆▆▆▆▆▆▇▆▆▆▆▆▆▆▆▆▇▆▆▆▆▆▆▆▆

0,1
best_checkpoint_path,checkpoints_midf1\be...
best_val_f1,0.1448
best_val_mid_f1,0.1448
epoch,4
lr,9e-05
params/ratio,0.18008
params/total,278813189
params/trainable,50209541
step,9256
time/epoch_sec,87.33525


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
Best trial: 0. Best value: 0.728393:  10%|█         | 2/20 [14:36<2:06:47, 422.63s/it]

[Trial 1] f1=0.0869 | unfreeze_k=7 lr=1.98e-04 wd=2.5e-06 suggested_bs=64
[I 2025-08-17 18:02:39,039] Trial 1 finished with value: 0.08687713959680486 and parameters: {'num_unfreeze_last_layers': 7, 'lr': 0.00019844028445501068, 'weight_decay': 2.540025143840635e-06, 'batch_size': 64}. Best is trial 0 with value: 0.7283928888695556.


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/mdeberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  scaler = GradScaler(enabled=(DEVICE == "cuda" and USE_AMP))


Trainable params: 85,648,901 / 278,813,189 (30.72%) ; unfreeze_last_k=12
[e1 b1/2315] loss=1.5376 avg=1.5376 it/s=120.0
[e1 b2/2315] loss=1.6045 avg=1.5710 it/s=151.2


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e1 b232/2315] loss=1.4637 avg=1.5481 it/s=336.4
[e1 b463/2315] loss=1.4402 avg=1.5091 it/s=337.7
[e1 b694/2315] loss=1.3643 avg=1.4736 it/s=339.9
[e1 b925/2315] loss=1.1458 avg=1.4203 it/s=344.2
[e1 b1156/2315] loss=0.7086 avg=1.3655 it/s=347.5
[e1 b1387/2315] loss=1.1630 avg=1.3137 it/s=349.6
[e1 b1618/2315] loss=0.7530 avg=1.2621 it/s=348.3
[e1 b1849/2315] loss=0.6535 avg=1.2161 it/s=348.7
[e1 b2080/2315] loss=0.7317 avg=1.1761 it/s=349.3
[e1 b2311/2315] loss=0.9213 avg=1.1459 it/s=349.6


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 1/7 | loss=1.1452 | val_acc=0.7279 | val_f1=0.7383 | time=110.8s
[e2 b1/2315] loss=0.6773 avg=0.6773 it/s=313.0
[e2 b2/2315] loss=0.8443 avg=0.7608 it/s=323.3


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e2 b232/2315] loss=0.7139 avg=0.7861 it/s=348.1
[e2 b463/2315] loss=0.6314 avg=0.7738 it/s=346.7
[e2 b694/2315] loss=0.5623 avg=0.7664 it/s=350.1
[e2 b925/2315] loss=0.4747 avg=0.7551 it/s=352.4
[e2 b1156/2315] loss=1.0590 avg=0.7501 it/s=354.7
[e2 b1387/2315] loss=0.4271 avg=0.7390 it/s=356.9
[e2 b1618/2315] loss=0.5642 avg=0.7342 it/s=358.1
[e2 b1849/2315] loss=0.4317 avg=0.7293 it/s=359.0
[e2 b2080/2315] loss=0.5881 avg=0.7224 it/s=359.7
[e2 b2311/2315] loss=0.9817 avg=0.7175 it/s=360.9


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 2/7 | loss=0.7174 | val_acc=0.8110 | val_f1=0.8168 | time=107.2s
[e3 b1/2315] loss=0.3580 avg=0.3580 it/s=287.6
[e3 b2/2315] loss=0.2944 avg=0.3262 it/s=295.2


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e3 b232/2315] loss=0.6506 avg=0.6019 it/s=329.3
[e3 b463/2315] loss=0.5186 avg=0.6213 it/s=330.0
[e3 b694/2315] loss=0.5284 avg=0.6172 it/s=329.0
[e3 b925/2315] loss=0.4438 avg=0.6087 it/s=336.2
[e3 b1156/2315] loss=1.0252 avg=0.6083 it/s=341.0
[e3 b1387/2315] loss=0.3283 avg=0.6082 it/s=342.5
[e3 b1618/2315] loss=0.9633 avg=0.6093 it/s=342.1
[e3 b1849/2315] loss=0.6438 avg=0.6075 it/s=339.6
[e3 b2080/2315] loss=0.5634 avg=0.6054 it/s=341.7
[e3 b2311/2315] loss=0.7712 avg=0.6040 it/s=343.8


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 3/7 | loss=0.6041 | val_acc=0.8190 | val_f1=0.8245 | time=112.4s
[e4 b1/2315] loss=0.4660 avg=0.4660 it/s=327.7
[e4 b2/2315] loss=0.7206 avg=0.5933 it/s=327.6


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e4 b232/2315] loss=0.5206 avg=0.5501 it/s=357.1
[e4 b463/2315] loss=0.5363 avg=0.5441 it/s=352.0
[e4 b694/2315] loss=0.3362 avg=0.5441 it/s=348.9
[e4 b925/2315] loss=0.5773 avg=0.5481 it/s=345.2
[e4 b1156/2315] loss=0.4762 avg=0.5490 it/s=345.3
[e4 b1387/2315] loss=0.4456 avg=0.5460 it/s=345.1
[e4 b1618/2315] loss=0.5660 avg=0.5432 it/s=345.7
[e4 b1849/2315] loss=0.3619 avg=0.5438 it/s=341.6
[e4 b2080/2315] loss=0.6659 avg=0.5401 it/s=324.7
[e4 b2311/2315] loss=0.5478 avg=0.5388 it/s=314.3


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 4/7 | loss=0.5388 | val_acc=0.8379 | val_f1=0.8433 | time=125.9s
[e5 b1/2315] loss=0.5390 avg=0.5390 it/s=277.0
[e5 b2/2315] loss=0.2956 avg=0.4173 it/s=285.9


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e5 b232/2315] loss=0.4916 avg=0.5022 it/s=236.5
[e5 b463/2315] loss=0.4255 avg=0.4956 it/s=222.1
[e5 b694/2315] loss=0.7153 avg=0.4974 it/s=251.2
[e5 b925/2315] loss=0.6022 avg=0.4979 it/s=272.3
[e5 b1156/2315] loss=0.3790 avg=0.4996 it/s=286.3
[e5 b1387/2315] loss=0.6434 avg=0.4961 it/s=296.7
[e5 b1618/2315] loss=0.7472 avg=0.4968 it/s=304.8
[e5 b1849/2315] loss=0.5014 avg=0.4974 it/s=311.1
[e5 b2080/2315] loss=0.2798 avg=0.4955 it/s=316.2
[e5 b2311/2315] loss=0.2311 avg=0.4937 it/s=320.5


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 5/7 | loss=0.4939 | val_acc=0.8491 | val_f1=0.8533 | time=120.2s
[e6 b1/2315] loss=0.7703 avg=0.7703 it/s=257.9
[e6 b2/2315] loss=0.6195 avg=0.6949 it/s=281.9


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e6 b232/2315] loss=0.2677 avg=0.4626 it/s=313.4
[e6 b463/2315] loss=0.7682 avg=0.4562 it/s=324.7
[e6 b694/2315] loss=0.3227 avg=0.4628 it/s=289.9
[e6 b925/2315] loss=0.6322 avg=0.4657 it/s=304.0
[e6 b1156/2315] loss=0.2930 avg=0.4632 it/s=313.7
[e6 b1387/2315] loss=0.2503 avg=0.4632 it/s=316.3
[e6 b1618/2315] loss=0.5553 avg=0.4647 it/s=319.2
[e6 b1849/2315] loss=0.3833 avg=0.4656 it/s=322.6
[e6 b2080/2315] loss=0.4047 avg=0.4644 it/s=326.6
[e6 b2311/2315] loss=0.6370 avg=0.4648 it/s=330.6


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 6/7 | loss=0.4647 | val_acc=0.8557 | val_f1=0.8595 | time=116.6s
[e7 b1/2315] loss=0.2544 avg=0.2544 it/s=409.5
[e7 b2/2315] loss=0.3090 avg=0.2817 it/s=361.4


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e7 b232/2315] loss=0.4812 avg=0.4353 it/s=349.2
[e7 b463/2315] loss=0.3045 avg=0.4461 it/s=347.7
[e7 b694/2315] loss=0.5692 avg=0.4438 it/s=345.9
[e7 b925/2315] loss=0.2287 avg=0.4469 it/s=345.3
[e7 b1156/2315] loss=0.2801 avg=0.4441 it/s=346.7
[e7 b1387/2315] loss=0.4281 avg=0.4426 it/s=343.3
[e7 b1618/2315] loss=0.2483 avg=0.4458 it/s=326.8
[e7 b1849/2315] loss=0.2959 avg=0.4472 it/s=330.3
[e7 b2080/2315] loss=0.4078 avg=0.4447 it/s=333.9
[e7 b2311/2315] loss=0.3339 avg=0.4427 it/s=336.7


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
[34m[1mwandb[0m: [32m[41mERROR[0m The nbformat package was not found. It is required to save notebook history.


Epoch 7/7 | loss=0.4427 | val_acc=0.8516 | val_f1=0.8557 | time=114.8s


0,1
epoch,▁▁▁▁▁▁▁▂▂▂▂▂▂▂▂▂▂▃▅▅▅▅▅▅▅▆▆▆▇▇▇▇▇▇▇▇████
lr,█▇▆▅▃▂▁
params/ratio,▁
params/total,▁
params/trainable,▁
step,▁▁▂▂▁▁▂▂▆▃▄▁▇▃█▃▃▁▂▂▂▂▃▃▃▁▂▂▂▂▃▁▁▂▃▃▁▂▂▃
time/epoch_sec,▂▁▃█▆▅▄
train/avg_loss_so_far,██▇▇▇▆▃▄▄▄▄▄▄▄▄▃▃▃▃▃▃▃▂▃▃▃▃▃▃▂▂▂▂▄▂▂▂▁▂▂
train/epoch_loss,█▄▃▂▂▁▁
train/items_per_sec,▁▇▇████████▆▆▇▇▇▇▇▇▇▇▇▇▇▇▅▅▆▆▆▅▆▇▇▇█▇▇█▇

0,1
best_checkpoint_path,checkpoints_midf1\be...
best_val_f1,0.84848
best_val_mid_f1,0.84848
epoch,7
lr,0
params/ratio,0.30719
params/total,278813189
params/trainable,85648901
step,16201
time/epoch_sec,114.78056


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
Best trial: 2. Best value: 0.85947:  15%|█▌        | 3/20 [28:19<2:51:31, 605.40s/it] 

[Trial 2] f1=0.8595 | unfreeze_k=12 lr=1.06e-05 wd=1.7e-06 suggested_bs=32
[I 2025-08-17 18:16:21,939] Trial 2 finished with value: 0.8594704602547099 and parameters: {'num_unfreeze_last_layers': 12, 'lr': 1.0637481988876116e-05, 'weight_decay': 1.653206260053585e-06, 'batch_size': 32}. Best is trial 2 with value: 0.8594704602547099.


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/mdeberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  scaler = GradScaler(enabled=(DEVICE == "cuda" and USE_AMP))


Trainable params: 85,648,901 / 278,813,189 (30.72%) ; unfreeze_last_k=12
[e1 b1/2315] loss=1.6446 avg=1.6446 it/s=136.1
[e1 b2/2315] loss=1.6214 avg=1.6330 it/s=185.5


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e1 b232/2315] loss=1.3890 avg=1.5095 it/s=342.7
[e1 b463/2315] loss=1.1294 avg=1.3942 it/s=339.9
[e1 b694/2315] loss=1.1207 avg=1.3047 it/s=341.5
[e1 b925/2315] loss=1.0801 avg=1.2418 it/s=345.8
[e1 b1156/2315] loss=0.9710 avg=1.2069 it/s=346.9
[e1 b1387/2315] loss=1.0736 avg=1.1785 it/s=348.9
[e1 b1618/2315] loss=1.1132 avg=1.1593 it/s=351.2
[e1 b1849/2315] loss=1.1802 avg=1.1434 it/s=353.0
[e1 b2080/2315] loss=1.1731 avg=1.1300 it/s=354.3
[e1 b2311/2315] loss=0.9765 avg=1.1218 it/s=355.5


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 1/7 | loss=1.1219 | val_acc=0.6368 | val_f1=0.6380 | time=109.0s
[e2 b1/2315] loss=0.8502 avg=0.8502 it/s=286.5
[e2 b2/2315] loss=0.9686 avg=0.9094 it/s=315.6


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e2 b232/2315] loss=1.1656 avg=1.0495 it/s=349.0
[e2 b463/2315] loss=1.1218 avg=1.0487 it/s=340.1
[e2 b694/2315] loss=0.9700 avg=1.0386 it/s=319.8
[e2 b925/2315] loss=0.8910 avg=1.0220 it/s=294.8
[e2 b1156/2315] loss=1.2190 avg=1.0135 it/s=290.0
[e2 b1387/2315] loss=0.5802 avg=0.9955 it/s=298.8
[e2 b1618/2315] loss=0.8224 avg=0.9776 it/s=304.1
[e2 b1849/2315] loss=1.2742 avg=0.9673 it/s=307.2
[e2 b2080/2315] loss=0.9030 avg=0.9574 it/s=312.0
[e2 b2311/2315] loss=1.1679 avg=0.9525 it/s=316.6


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 2/7 | loss=0.9523 | val_acc=0.7021 | val_f1=0.7119 | time=121.6s
[e3 b1/2315] loss=0.9251 avg=0.9251 it/s=291.6
[e3 b2/2315] loss=0.7322 avg=0.8287 it/s=300.4


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e3 b232/2315] loss=1.3976 avg=0.9818 it/s=346.9
[e3 b463/2315] loss=0.7354 avg=0.9315 it/s=346.5
[e3 b694/2315] loss=1.0098 avg=0.9605 it/s=347.0
[e3 b925/2315] loss=0.8022 avg=0.9674 it/s=346.9
[e3 b1156/2315] loss=1.0177 avg=0.9771 it/s=347.1
[e3 b1387/2315] loss=1.3215 avg=0.9897 it/s=347.4
[e3 b1618/2315] loss=1.0416 avg=1.0002 it/s=347.6
[e3 b1849/2315] loss=1.4204 avg=1.0080 it/s=349.4
[e3 b2080/2315] loss=0.7321 avg=1.0137 it/s=350.7
[e3 b2311/2315] loss=1.1182 avg=1.0088 it/s=351.9


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 3/7 | loss=1.0091 | val_acc=0.6579 | val_f1=0.6675 | time=109.9s
[e4 b1/2315] loss=0.8839 avg=0.8839 it/s=370.7
[e4 b2/2315] loss=0.7811 avg=0.8325 it/s=429.1


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e4 b232/2315] loss=0.8668 avg=0.8935 it/s=365.7
[e4 b463/2315] loss=0.5863 avg=0.8775 it/s=359.4
[e4 b694/2315] loss=0.6483 avg=0.8682 it/s=354.1
[e4 b925/2315] loss=0.7832 avg=0.8607 it/s=352.4
[e4 b1156/2315] loss=1.4853 avg=0.8520 it/s=352.2
[e4 b1387/2315] loss=0.7102 avg=0.8478 it/s=354.7
[e4 b1618/2315] loss=0.9793 avg=0.8475 it/s=355.2
[e4 b1849/2315] loss=0.9759 avg=0.8454 it/s=354.7
[e4 b2080/2315] loss=0.6764 avg=0.8398 it/s=355.3
[e4 b2311/2315] loss=1.0068 avg=0.8352 it/s=355.8


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 4/7 | loss=0.8350 | val_acc=0.7379 | val_f1=0.7442 | time=108.8s
[e5 b1/2315] loss=0.4744 avg=0.4744 it/s=293.3
[e5 b2/2315] loss=0.5321 avg=0.5033 it/s=341.1


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e5 b232/2315] loss=0.7824 avg=0.7374 it/s=358.3
[e5 b463/2315] loss=0.9482 avg=0.7567 it/s=357.0
[e5 b694/2315] loss=0.8095 avg=0.7811 it/s=350.7
[e5 b925/2315] loss=0.7145 avg=0.7835 it/s=344.8
[e5 b1156/2315] loss=0.6692 avg=0.7846 it/s=342.8
[e5 b1387/2315] loss=0.9102 avg=0.7837 it/s=344.6
[e5 b1618/2315] loss=0.5888 avg=0.7837 it/s=346.7
[e5 b1849/2315] loss=0.6274 avg=0.7807 it/s=346.6
[e5 b2080/2315] loss=0.8852 avg=0.7770 it/s=344.6
[e5 b2311/2315] loss=0.6755 avg=0.7712 it/s=341.7


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 5/7 | loss=0.7708 | val_acc=0.7592 | val_f1=0.7659 | time=113.2s
[e6 b1/2315] loss=0.7883 avg=0.7883 it/s=292.7
[e6 b2/2315] loss=0.3054 avg=0.5468 it/s=292.6


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e6 b232/2315] loss=0.8757 avg=0.6823 it/s=342.6
[e6 b463/2315] loss=0.6216 avg=0.6851 it/s=353.6
[e6 b694/2315] loss=1.2101 avg=0.6821 it/s=353.0
[e6 b925/2315] loss=0.4845 avg=0.6761 it/s=351.3
[e6 b1156/2315] loss=0.7987 avg=0.6750 it/s=349.4
[e6 b1387/2315] loss=0.3886 avg=0.6714 it/s=350.0
[e6 b1618/2315] loss=1.0782 avg=0.6723 it/s=347.5
[e6 b1849/2315] loss=0.2431 avg=0.6703 it/s=345.8
[e6 b2080/2315] loss=0.4843 avg=0.6681 it/s=344.8
[e6 b2311/2315] loss=0.6948 avg=0.6634 it/s=344.2


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 6/7 | loss=0.6637 | val_acc=0.8015 | val_f1=0.8077 | time=112.3s
[e7 b1/2315] loss=0.6468 avg=0.6468 it/s=318.5
[e7 b2/2315] loss=0.4950 avg=0.5709 it/s=325.9


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e7 b232/2315] loss=0.8180 avg=0.6094 it/s=355.0
[e7 b463/2315] loss=0.2843 avg=0.6000 it/s=354.5
[e7 b694/2315] loss=0.5402 avg=0.6000 it/s=354.8
[e7 b925/2315] loss=0.3634 avg=0.5979 it/s=352.0
[e7 b1156/2315] loss=0.6099 avg=0.5971 it/s=344.6
[e7 b1387/2315] loss=1.0924 avg=0.5962 it/s=340.7
[e7 b1618/2315] loss=0.7196 avg=0.5950 it/s=339.2
[e7 b1849/2315] loss=0.5517 avg=0.5933 it/s=336.4
[e7 b2080/2315] loss=0.4896 avg=0.5906 it/s=338.0
[e7 b2311/2315] loss=0.2794 avg=0.5900 it/s=339.4


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
[34m[1mwandb[0m: [32m[41mERROR[0m The nbformat package was not found. It is required to save notebook history.


Epoch 7/7 | loss=0.5898 | val_acc=0.8088 | val_f1=0.8138 | time=114.0s


0,1
epoch,▁▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▃▃▅▅▅▅▆▆▆▆▆▆▆▆▇▇▇▇▇▇▇▇██
lr,█▇▆▅▃▂▁
params/ratio,▁
params/total,▁
params/trainable,▁
step,▁▁▁▂▂▁▁▃▂▂▁▁▂▄▂▂▂▁▁▁▂▆▁▁▁▂▁▁▁▂▂██▂▁▁▁▂▂▂
time/epoch_sec,▁█▂▁▃▃▄
train/avg_loss_so_far,█▇▆▅▅▄▄▄▄▄▄▄▄▄▄▄▃▃▃▃▃▃▃▃▁▃▃▃▃▃▂▂▂▂▂▁▂▂▂▂
train/epoch_loss,█▆▇▄▃▂▁
train/items_per_sec,▁▇▇▇▇▇▇▆▇▆▆▅▇▇▇███▇▇▇▇▅▇▇▇▇▇▅▇▇▇▇▇▇▇▇▇▇▇

0,1
best_checkpoint_path,checkpoints_midf1\be...
best_val_f1,0.80278
best_val_mid_f1,0.80278
epoch,7
lr,0
params/ratio,0.30719
params/total,278813189
params/trainable,85648901
step,16201
time/epoch_sec,113.96431


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
Best trial: 2. Best value: 0.85947:  20%|██        | 4/20 [41:42<3:02:18, 683.64s/it]

[Trial 3] f1=0.8138 | unfreeze_k=12 lr=1.16e-04 wd=1.1e-07 suggested_bs=32
[I 2025-08-17 18:29:45,520] Trial 3 finished with value: 0.8138106051208209 and parameters: {'num_unfreeze_last_layers': 12, 'lr': 0.0001163037715315009, 'weight_decay': 1.1385339326222966e-07, 'batch_size': 32}. Best is trial 2 with value: 0.8594704602547099.


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/mdeberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  scaler = GradScaler(enabled=(DEVICE == "cuda" and USE_AMP))


Trainable params: 50,209,541 / 278,813,189 (18.01%) ; unfreeze_last_k=7
[e1 b1/2315] loss=1.5843 avg=1.5843 it/s=153.4
[e1 b2/2315] loss=1.5670 avg=1.5756 it/s=211.3


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e1 b232/2315] loss=1.5370 avg=1.5106 it/s=445.5
[e1 b463/2315] loss=1.2510 avg=1.4568 it/s=457.4
[e1 b694/2315] loss=1.0037 avg=1.3826 it/s=458.4
[e1 b925/2315] loss=1.2288 avg=1.3159 it/s=453.7
[e1 b1156/2315] loss=0.9966 avg=1.2571 it/s=451.6
[e1 b1387/2315] loss=1.0067 avg=1.2120 it/s=444.3
[e1 b1618/2315] loss=0.9697 avg=1.1740 it/s=441.9
[e1 b1849/2315] loss=0.9515 avg=1.1437 it/s=441.2
[e1 b2080/2315] loss=0.8614 avg=1.1153 it/s=443.4
[e1 b2311/2315] loss=1.2049 avg=1.0904 it/s=443.5


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 1/7 | loss=1.0903 | val_acc=0.7184 | val_f1=0.7261 | time=88.5s
[e2 b1/2315] loss=0.9772 avg=0.9772 it/s=342.1
[e2 b2/2315] loss=0.8922 avg=0.9347 it/s=424.6


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e2 b232/2315] loss=0.7104 avg=0.7933 it/s=409.9
[e2 b463/2315] loss=1.1932 avg=0.8014 it/s=413.8
[e2 b694/2315] loss=0.8278 avg=0.7956 it/s=421.1
[e2 b925/2315] loss=0.6018 avg=0.7864 it/s=429.8
[e2 b1156/2315] loss=0.9132 avg=0.7812 it/s=437.7
[e2 b1387/2315] loss=0.5784 avg=0.7759 it/s=443.9
[e2 b1618/2315] loss=1.1907 avg=0.7705 it/s=448.4
[e2 b1849/2315] loss=0.9949 avg=0.7661 it/s=446.9
[e2 b2080/2315] loss=0.7466 avg=0.7621 it/s=447.5
[e2 b2311/2315] loss=0.5054 avg=0.7579 it/s=446.5


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 2/7 | loss=0.7579 | val_acc=0.7838 | val_f1=0.7909 | time=87.9s
[e3 b1/2315] loss=0.8056 avg=0.8056 it/s=667.7
[e3 b2/2315] loss=0.5521 avg=0.6788 it/s=401.6


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e3 b232/2315] loss=0.5205 avg=0.6551 it/s=432.5
[e3 b463/2315] loss=0.4878 avg=0.6470 it/s=433.4
[e3 b694/2315] loss=0.8300 avg=0.6589 it/s=431.9
[e3 b925/2315] loss=0.7498 avg=0.6594 it/s=432.5
[e3 b1156/2315] loss=0.7103 avg=0.6561 it/s=435.8
[e3 b1387/2315] loss=0.6494 avg=0.6577 it/s=440.5
[e3 b1618/2315] loss=0.3162 avg=0.6509 it/s=439.1
[e3 b1849/2315] loss=1.0125 avg=0.6457 it/s=443.2
[e3 b2080/2315] loss=0.4619 avg=0.6407 it/s=442.8
[e3 b2311/2315] loss=0.4783 avg=0.6378 it/s=443.5


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 3/7 | loss=0.6375 | val_acc=0.7884 | val_f1=0.7938 | time=88.3s
[e4 b1/2315] loss=0.6501 avg=0.6501 it/s=416.9
[e4 b2/2315] loss=0.2998 avg=0.4750 it/s=453.1


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e4 b232/2315] loss=0.2392 avg=0.5663 it/s=452.9
[e4 b463/2315] loss=0.4482 avg=0.5585 it/s=453.7
[e4 b694/2315] loss=0.6198 avg=0.5641 it/s=452.3
[e4 b925/2315] loss=0.3448 avg=0.5640 it/s=450.4
[e4 b1156/2315] loss=0.4991 avg=0.5633 it/s=447.4
[e4 b1387/2315] loss=0.5408 avg=0.5599 it/s=450.7
[e4 b1618/2315] loss=0.5161 avg=0.5595 it/s=452.5
[e4 b1849/2315] loss=0.6372 avg=0.5581 it/s=454.2
[e4 b2080/2315] loss=0.4992 avg=0.5575 it/s=454.6
[e4 b2311/2315] loss=0.3993 avg=0.5537 it/s=454.8


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 4/7 | loss=0.5536 | val_acc=0.8071 | val_f1=0.8131 | time=86.2s
[e5 b1/2315] loss=0.2158 avg=0.2158 it/s=553.4
[e5 b2/2315] loss=0.5483 avg=0.3821 it/s=417.8


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e5 b232/2315] loss=0.7368 avg=0.4895 it/s=461.3
[e5 b463/2315] loss=0.7909 avg=0.5012 it/s=463.0
[e5 b694/2315] loss=0.4637 avg=0.4998 it/s=464.1
[e5 b925/2315] loss=0.6852 avg=0.4970 it/s=465.3
[e5 b1156/2315] loss=0.6541 avg=0.4991 it/s=465.8
[e5 b1387/2315] loss=0.4896 avg=0.4981 it/s=461.2
[e5 b1618/2315] loss=1.0467 avg=0.5002 it/s=455.5
[e5 b1849/2315] loss=0.5786 avg=0.5014 it/s=448.6
[e5 b2080/2315] loss=0.6722 avg=0.5009 it/s=447.3
[e5 b2311/2315] loss=0.7440 avg=0.4998 it/s=448.0


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 5/7 | loss=0.4995 | val_acc=0.8256 | val_f1=0.8299 | time=87.3s
[e6 b1/2315] loss=0.5835 avg=0.5835 it/s=465.3
[e6 b2/2315] loss=0.2142 avg=0.3988 it/s=489.0


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e6 b232/2315] loss=0.2293 avg=0.4460 it/s=479.8
[e6 b463/2315] loss=0.2693 avg=0.4565 it/s=453.9
[e6 b694/2315] loss=0.3337 avg=0.4551 it/s=444.6
[e6 b925/2315] loss=0.3470 avg=0.4570 it/s=443.6
[e6 b1156/2315] loss=0.9098 avg=0.4604 it/s=445.6
[e6 b1387/2315] loss=0.7737 avg=0.4588 it/s=451.4
[e6 b1618/2315] loss=0.2355 avg=0.4577 it/s=455.3
[e6 b1849/2315] loss=1.0014 avg=0.4554 it/s=458.5
[e6 b2080/2315] loss=0.3455 avg=0.4545 it/s=455.3
[e6 b2311/2315] loss=0.3751 avg=0.4539 it/s=453.8


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 6/7 | loss=0.4538 | val_acc=0.8355 | val_f1=0.8390 | time=86.4s
[e7 b1/2315] loss=0.6423 avg=0.6423 it/s=394.0
[e7 b2/2315] loss=0.4174 avg=0.5299 it/s=431.6


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e7 b232/2315] loss=0.3695 avg=0.4346 it/s=437.2
[e7 b463/2315] loss=0.3073 avg=0.4237 it/s=436.7
[e7 b694/2315] loss=0.2824 avg=0.4183 it/s=443.0
[e7 b925/2315] loss=0.3586 avg=0.4159 it/s=445.6
[e7 b1156/2315] loss=0.7816 avg=0.4162 it/s=442.7
[e7 b1387/2315] loss=0.3165 avg=0.4155 it/s=438.3
[e7 b1618/2315] loss=0.3477 avg=0.4138 it/s=440.1
[e7 b1849/2315] loss=0.6615 avg=0.4146 it/s=444.4
[e7 b2080/2315] loss=0.2646 avg=0.4131 it/s=447.4
[e7 b2311/2315] loss=0.6028 avg=0.4138 it/s=449.1


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
[34m[1mwandb[0m: [32m[41mERROR[0m The nbformat package was not found. It is required to save notebook history.


Epoch 7/7 | loss=0.4139 | val_acc=0.8389 | val_f1=0.8421 | time=88.0s


0,1
epoch,▁▁▁▁▁▂▂▂▂▂▂▂▂▂▃▃▃▃▅▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇█████
lr,█▇▆▅▃▂▁
params/ratio,▁
params/total,▁
params/trainable,▁
step,▁▁▁▁▁▂▁▂▂▂▁▁▁▂▁▁▁▁▅▂▂▂▂▁▁▁▂▂▂▂▁▁▁▁▁▂▁▁▇█
time/epoch_sec,█▆▇▁▄▁▆
train/avg_loss_so_far,██▇▆▆▆▅▅▄▄▄▄▄▃▃▃▃▂▃▃▃▃▃▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂
train/epoch_loss,█▅▃▂▂▁▁
train/items_per_sec,▁▂▅▅▅▅▅▄▄▅▅▅▅█▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▄▅▅▅▅▅

0,1
best_checkpoint_path,checkpoints_midf1\be...
best_val_f1,0.83375
best_val_mid_f1,0.83375
epoch,7
lr,0
params/ratio,0.18008
params/total,278813189
params/trainable,50209541
step,16201
time/epoch_sec,88.01029


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
Best trial: 2. Best value: 0.85947:  25%|██▌       | 5/20 [52:11<2:45:57, 663.82s/it]

[Trial 4] f1=0.8421 | unfreeze_k=7 lr=5.05e-05 wd=4.4e-07 suggested_bs=64
[I 2025-08-17 18:40:14,211] Trial 4 finished with value: 0.8420752118483215 and parameters: {'num_unfreeze_last_layers': 7, 'lr': 5.0537788313468655e-05, 'weight_decay': 4.4345882224489024e-07, 'batch_size': 64}. Best is trial 2 with value: 0.8594704602547099.


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/mdeberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  scaler = GradScaler(enabled=(DEVICE == "cuda" and USE_AMP))


Trainable params: 43,121,669 / 278,813,189 (15.47%) ; unfreeze_last_k=6
[e1 b1/2315] loss=1.6119 avg=1.6119 it/s=142.2
[e1 b2/2315] loss=1.5824 avg=1.5971 it/s=199.6


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e1 b232/2315] loss=1.1728 avg=1.4827 it/s=436.9
[e1 b463/2315] loss=1.4331 avg=1.4333 it/s=437.5
[e1 b694/2315] loss=1.3007 avg=1.4168 it/s=456.9
[e1 b925/2315] loss=1.5468 avg=1.4107 it/s=463.5
[e1 b1156/2315] loss=1.2210 avg=1.4115 it/s=466.1
[e1 b1387/2315] loss=1.3763 avg=1.4245 it/s=464.0
[e1 b1618/2315] loss=1.5808 avg=1.4345 it/s=459.7
[e1 b1849/2315] loss=1.6804 avg=1.4442 it/s=464.2
[e1 b2080/2315] loss=1.4458 avg=1.4509 it/s=467.1
[e1 b2311/2315] loss=1.4866 avg=1.4559 it/s=464.8


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 1/7 | loss=1.4558 | val_acc=0.2775 | val_f1=0.0869 | time=84.4s
[e2 b1/2315] loss=1.6422 avg=1.6422 it/s=468.1
[e2 b2/2315] loss=1.5409 avg=1.5916 it/s=472.0


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e2 b232/2315] loss=1.5395 avg=1.5056 it/s=495.4
[e2 b463/2315] loss=1.5010 avg=1.5001 it/s=491.2
[e2 b694/2315] loss=1.5701 avg=1.5012 it/s=492.1
[e2 b925/2315] loss=1.5392 avg=1.5011 it/s=492.7
[e2 b1156/2315] loss=1.6482 avg=1.4999 it/s=492.9
[e2 b1387/2315] loss=1.5625 avg=1.4994 it/s=482.8
[e2 b1618/2315] loss=1.4250 avg=1.4981 it/s=474.1
[e2 b1849/2315] loss=1.4628 avg=1.4976 it/s=466.6
[e2 b2080/2315] loss=1.5461 avg=1.4989 it/s=465.8
[e2 b2311/2315] loss=1.7147 avg=1.4981 it/s=467.9


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 2/7 | loss=1.4980 | val_acc=0.2410 | val_f1=0.0777 | time=84.0s
[e3 b1/2315] loss=1.4005 avg=1.4005 it/s=913.8
[e3 b2/2315] loss=1.4206 avg=1.4105 it/s=497.8


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e3 b232/2315] loss=1.4458 avg=1.4947 it/s=476.9
[e3 b463/2315] loss=1.5363 avg=1.4951 it/s=484.3
[e3 b694/2315] loss=1.5966 avg=1.4958 it/s=478.2
[e3 b925/2315] loss=1.4265 avg=1.4953 it/s=474.9
[e3 b1156/2315] loss=1.5487 avg=1.4955 it/s=472.6
[e3 b1387/2315] loss=1.5232 avg=1.4949 it/s=472.0
[e3 b1618/2315] loss=1.5481 avg=1.4956 it/s=475.7
[e3 b1849/2315] loss=1.4103 avg=1.4943 it/s=479.1
[e3 b2080/2315] loss=1.4563 avg=1.4953 it/s=480.9
[e3 b2311/2315] loss=1.5767 avg=1.4961 it/s=483.6


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 3/7 | loss=1.4961 | val_acc=0.2775 | val_f1=0.0869 | time=81.5s
[e4 b1/2315] loss=1.4743 avg=1.4743 it/s=516.4
[e4 b2/2315] loss=1.5351 avg=1.5047 it/s=507.8


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e4 b232/2315] loss=1.5218 avg=1.4992 it/s=461.0
[e4 b463/2315] loss=1.4928 avg=1.4938 it/s=467.1
[e4 b694/2315] loss=1.3570 avg=1.4937 it/s=470.4
[e4 b925/2315] loss=1.4071 avg=1.4928 it/s=464.9
[e4 b1156/2315] loss=1.4632 avg=1.4940 it/s=461.7
[e4 b1387/2315] loss=1.4971 avg=1.4942 it/s=461.4
[e4 b1618/2315] loss=1.3537 avg=1.4928 it/s=463.2
[e4 b1849/2315] loss=1.5375 avg=1.4933 it/s=465.4
[e4 b2080/2315] loss=1.5231 avg=1.4948 it/s=464.1
[e4 b2311/2315] loss=1.3656 avg=1.4951 it/s=462.4


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
[34m[1mwandb[0m: [32m[41mERROR[0m The nbformat package was not found. It is required to save notebook history.


Early stopping at epoch 4


0,1
epoch,▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▃▃▃▃▃▆▆▆▆▆▆▆▆▆▆▆███████
lr,█▆▃▁
params/ratio,▁
params/total,▁
params/trainable,▁
step,▁▁▂▂▂▃▄▅▆▇██▁▁▂▅▅▅▇▇▇▇▂▂▃▃▄▄▄▄▅▆▆▆▇▂▃▅▆█
time/epoch_sec,▇▆▁█
train/avg_loss_so_far,▇▇▃▂▁▁▂▂▂▂█▇▄▄▄▄▄▄▄▄▁▁▄▄▄▄▄▄▄▄▃▄▄▄▄▄▄▄▄▄
train/epoch_loss,▁███
train/items_per_sec,▁▂▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄█▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄

0,1
best_checkpoint_path,checkpoints_midf1\be...
best_val_f1,0.1448
best_val_mid_f1,0.1448
epoch,4
lr,0.0002
params/ratio,0.15466
params/total,278813189
params/trainable,43121669
step,9256
time/epoch_sec,85.032


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
Best trial: 2. Best value: 0.85947:  30%|███       | 6/20 [57:56<2:09:35, 555.41s/it]

[Trial 5] f1=0.0869 | unfreeze_k=6 lr=4.46e-04 wd=1.5e-07 suggested_bs=64
[I 2025-08-17 18:45:59,159] Trial 5 finished with value: 0.08687713959680486 and parameters: {'num_unfreeze_last_layers': 6, 'lr': 0.000446214650582172, 'weight_decay': 1.4923083393062914e-07, 'batch_size': 64}. Best is trial 2 with value: 0.8594704602547099.


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/mdeberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  scaler = GradScaler(enabled=(DEVICE == "cuda" and USE_AMP))


Trainable params: 28,945,925 / 278,813,189 (10.38%) ; unfreeze_last_k=4
[e1 b1/2315] loss=1.6452 avg=1.6452 it/s=152.8
[e1 b2/2315] loss=1.6148 avg=1.6300 it/s=231.7


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e1 b232/2315] loss=1.4870 avg=1.5740 it/s=508.0
[e1 b463/2315] loss=1.5737 avg=1.5264 it/s=521.9
[e1 b694/2315] loss=1.4736 avg=1.4940 it/s=533.6
[e1 b925/2315] loss=1.2739 avg=1.4526 it/s=531.1
[e1 b1156/2315] loss=1.3431 avg=1.4133 it/s=529.1
[e1 b1387/2315] loss=1.0684 avg=1.3790 it/s=525.6
[e1 b1618/2315] loss=1.2092 avg=1.3480 it/s=521.9
[e1 b1849/2315] loss=1.3150 avg=1.3232 it/s=526.6
[e1 b2080/2315] loss=1.2876 avg=1.3003 it/s=530.2
[e1 b2311/2315] loss=0.9985 avg=1.2816 it/s=532.2


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 1/7 | loss=1.2812 | val_acc=0.5688 | val_f1=0.5838 | time=74.5s
[e2 b1/2315] loss=1.1050 avg=1.1050 it/s=439.3
[e2 b2/2315] loss=1.1648 avg=1.1349 it/s=499.1


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e2 b232/2315] loss=1.0485 avg=1.0945 it/s=540.2
[e2 b463/2315] loss=1.1071 avg=1.0819 it/s=552.7
[e2 b694/2315] loss=1.1783 avg=1.0662 it/s=556.7
[e2 b925/2315] loss=1.1136 avg=1.0603 it/s=559.5
[e2 b1156/2315] loss=1.1078 avg=1.0547 it/s=561.3
[e2 b1387/2315] loss=1.0662 avg=1.0500 it/s=561.7
[e2 b1618/2315] loss=1.4172 avg=1.0454 it/s=562.4
[e2 b1849/2315] loss=1.1393 avg=1.0437 it/s=562.7
[e2 b2080/2315] loss=1.0408 avg=1.0425 it/s=563.4
[e2 b2311/2315] loss=1.3159 avg=1.0391 it/s=564.3


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 2/7 | loss=1.0392 | val_acc=0.6147 | val_f1=0.6280 | time=70.4s
[e3 b1/2315] loss=1.1742 avg=1.1742 it/s=492.4
[e3 b2/2315] loss=1.2648 avg=1.2195 it/s=494.4


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e3 b232/2315] loss=0.9644 avg=0.9748 it/s=497.6
[e3 b463/2315] loss=1.3069 avg=0.9716 it/s=514.8
[e3 b694/2315] loss=0.8331 avg=0.9754 it/s=516.1
[e3 b925/2315] loss=1.1875 avg=0.9809 it/s=522.2
[e3 b1156/2315] loss=0.9228 avg=0.9800 it/s=530.5
[e3 b1387/2315] loss=1.0092 avg=0.9794 it/s=536.9
[e3 b1618/2315] loss=1.0481 avg=0.9778 it/s=542.0
[e3 b1849/2315] loss=1.0338 avg=0.9767 it/s=544.9
[e3 b2080/2315] loss=0.8454 avg=0.9761 it/s=541.5
[e3 b2311/2315] loss=0.9836 avg=0.9742 it/s=537.6


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 3/7 | loss=0.9741 | val_acc=0.6317 | val_f1=0.6430 | time=73.6s
[e4 b1/2315] loss=1.3292 avg=1.3292 it/s=448.4
[e4 b2/2315] loss=0.8172 avg=1.0732 it/s=479.8


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e4 b232/2315] loss=1.0180 avg=0.9309 it/s=535.5
[e4 b463/2315] loss=0.5214 avg=0.9399 it/s=546.2
[e4 b694/2315] loss=1.0001 avg=0.9449 it/s=554.0
[e4 b925/2315] loss=1.0522 avg=0.9341 it/s=557.8
[e4 b1156/2315] loss=1.0924 avg=0.9331 it/s=558.7
[e4 b1387/2315] loss=1.0997 avg=0.9363 it/s=558.1
[e4 b1618/2315] loss=0.9961 avg=0.9331 it/s=555.5
[e4 b1849/2315] loss=1.2708 avg=0.9330 it/s=554.0
[e4 b2080/2315] loss=1.0516 avg=0.9317 it/s=553.4
[e4 b2311/2315] loss=0.8338 avg=0.9309 it/s=551.8


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 4/7 | loss=0.9309 | val_acc=0.6351 | val_f1=0.6483 | time=71.9s
[e5 b1/2315] loss=0.8333 avg=0.8333 it/s=547.9
[e5 b2/2315] loss=0.7314 avg=0.7824 it/s=529.8


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e5 b232/2315] loss=0.7912 avg=0.9201 it/s=560.5
[e5 b463/2315] loss=1.0100 avg=0.9226 it/s=554.6
[e5 b694/2315] loss=0.7032 avg=0.9102 it/s=521.8
[e5 b925/2315] loss=0.8102 avg=0.9127 it/s=459.8
[e5 b1156/2315] loss=0.7983 avg=0.9114 it/s=445.0
[e5 b1387/2315] loss=0.8821 avg=0.9095 it/s=444.1
[e5 b1618/2315] loss=0.7161 avg=0.9097 it/s=458.0
[e5 b1849/2315] loss=1.1196 avg=0.9090 it/s=468.7
[e5 b2080/2315] loss=1.1344 avg=0.9060 it/s=479.2
[e5 b2311/2315] loss=0.8747 avg=0.9050 it/s=469.1


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 5/7 | loss=0.9053 | val_acc=0.6535 | val_f1=0.6644 | time=83.8s
[e6 b1/2315] loss=1.0236 avg=1.0236 it/s=458.0
[e6 b2/2315] loss=0.9302 avg=0.9769 it/s=509.4


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e6 b232/2315] loss=0.7782 avg=0.8968 it/s=397.4
[e6 b463/2315] loss=0.9731 avg=0.8796 it/s=457.2
[e6 b694/2315] loss=0.9302 avg=0.8885 it/s=435.4
[e6 b925/2315] loss=0.7699 avg=0.8892 it/s=455.9
[e6 b1156/2315] loss=0.9374 avg=0.8898 it/s=440.8
[e6 b1387/2315] loss=1.1619 avg=0.8872 it/s=439.0
[e6 b1618/2315] loss=0.7270 avg=0.8867 it/s=441.9
[e6 b1849/2315] loss=0.9083 avg=0.8864 it/s=434.6
[e6 b2080/2315] loss=0.5783 avg=0.8844 it/s=441.3
[e6 b2311/2315] loss=0.9322 avg=0.8845 it/s=438.6


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 6/7 | loss=0.8846 | val_acc=0.6550 | val_f1=0.6659 | time=89.1s
[e7 b1/2315] loss=1.1452 avg=1.1452 it/s=524.6
[e7 b2/2315] loss=0.7919 avg=0.9686 it/s=537.1


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e7 b232/2315] loss=0.8839 avg=0.8753 it/s=410.3
[e7 b463/2315] loss=0.7830 avg=0.8687 it/s=464.0
[e7 b694/2315] loss=0.6566 avg=0.8689 it/s=440.9
[e7 b925/2315] loss=0.8815 avg=0.8668 it/s=459.3
[e7 b1156/2315] loss=0.7697 avg=0.8702 it/s=440.8
[e7 b1387/2315] loss=0.7639 avg=0.8717 it/s=434.9
[e7 b1618/2315] loss=0.7008 avg=0.8676 it/s=437.1
[e7 b1849/2315] loss=1.1464 avg=0.8658 it/s=432.6
[e7 b2080/2315] loss=0.7640 avg=0.8663 it/s=443.5
[e7 b2311/2315] loss=0.7326 avg=0.8681 it/s=441.0


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
[34m[1mwandb[0m: [32m[41mERROR[0m The nbformat package was not found. It is required to save notebook history.


Epoch 7/7 | loss=0.8680 | val_acc=0.6579 | val_f1=0.6692 | time=88.8s


0,1
epoch,▁▁▁▁▁▂▂▂▃▃▃▃▃▃▃▅▅▅▅▅▅▅▆▆▆▆▆▆▆▆▇▇████████
lr,█▇▆▅▃▂▁
params/ratio,▁
params/total,▁
params/trainable,▁
step,▁▁▁▁▂▂▂▂▂▁▂▂▂▂▁▁▄▄▂▁▁▂▂▂▂▁▂▂▂▂▂▂▂▂▁█▂▂▂▂
time/epoch_sec,▃▁▂▂▆██
train/avg_loss_so_far,██▇▇▇▆▆▄▄▃▃▃▄▃▃▃▄▂▂▂▂▂▁▁▂▂▂▂▃▂▂▂▂▄▃▂▂▂▂▂
train/epoch_loss,█▄▃▂▂▁▁
train/items_per_sec,▁▂▇▇▇▇██████▇▇▇███▆▇█████▇▆▆▆▅▆▆▆▆▇▆▆▆▆▆

0,1
best_checkpoint_path,checkpoints_midf1\be...
best_val_f1,0.65052
best_val_mid_f1,0.65052
epoch,7
lr,0
params/ratio,0.10382
params/total,278813189
params/trainable,28945925
step,16201
time/epoch_sec,88.80332


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
Best trial: 2. Best value: 0.85947:  35%|███▌      | 7/20 [1:07:25<2:01:19, 559.99s/it]

[Trial 6] f1=0.6692 | unfreeze_k=4 lr=1.53e-05 wd=3.7e-07 suggested_bs=64
[I 2025-08-17 18:55:28,587] Trial 6 finished with value: 0.6692302476144392 and parameters: {'num_unfreeze_last_layers': 4, 'lr': 1.5302923725865004e-05, 'weight_decay': 3.682752200470921e-07, 'batch_size': 64}. Best is trial 2 with value: 0.8594704602547099.


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/mdeberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  scaler = GradScaler(enabled=(DEVICE == "cuda" and USE_AMP))


Trainable params: 28,945,925 / 278,813,189 (10.38%) ; unfreeze_last_k=4
[e1 b1/2315] loss=1.6640 avg=1.6640 it/s=129.2
[e1 b2/2315] loss=1.6012 avg=1.6326 it/s=189.7


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e1 b232/2315] loss=1.2755 avg=1.4771 it/s=444.7
[e1 b463/2315] loss=1.1440 avg=1.3863 it/s=423.8
[e1 b694/2315] loss=1.5321 avg=1.3374 it/s=419.8
[e1 b925/2315] loss=1.3941 avg=1.3327 it/s=430.9
[e1 b1156/2315] loss=1.5648 avg=1.3477 it/s=427.8
[e1 b1387/2315] loss=1.3952 avg=1.3633 it/s=444.5
[e1 b1618/2315] loss=1.3948 avg=1.3655 it/s=439.0
[e1 b1849/2315] loss=1.6615 avg=1.3662 it/s=445.7
[e1 b2080/2315] loss=1.2340 avg=1.3688 it/s=435.8
[e1 b2311/2315] loss=1.1336 avg=1.3625 it/s=428.1


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 1/7 | loss=1.3622 | val_acc=0.4130 | val_f1=0.2925 | time=93.1s
[e2 b1/2315] loss=1.1158 avg=1.1158 it/s=660.7
[e2 b2/2315] loss=1.4474 avg=1.2816 it/s=477.6


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e2 b232/2315] loss=1.1505 avg=1.2876 it/s=340.7
[e2 b463/2315] loss=1.2546 avg=1.2991 it/s=362.4
[e2 b694/2315] loss=1.4114 avg=1.2988 it/s=375.5
[e2 b925/2315] loss=1.0107 avg=1.2978 it/s=370.6
[e2 b1156/2315] loss=1.1883 avg=1.2984 it/s=363.9
[e2 b1387/2315] loss=1.1614 avg=1.2913 it/s=386.3
[e2 b1618/2315] loss=1.7033 avg=1.2873 it/s=405.0
[e2 b1849/2315] loss=1.2886 avg=1.2856 it/s=420.7
[e2 b2080/2315] loss=1.3015 avg=1.2896 it/s=433.5
[e2 b2311/2315] loss=1.4152 avg=1.2894 it/s=441.8


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 2/7 | loss=1.2893 | val_acc=0.4507 | val_f1=0.3894 | time=88.6s
[e3 b1/2315] loss=1.1576 avg=1.1576 it/s=570.7
[e3 b2/2315] loss=1.3771 avg=1.2673 it/s=766.5


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e3 b232/2315] loss=1.6178 avg=1.2419 it/s=402.1
[e3 b463/2315] loss=1.1250 avg=1.2923 it/s=466.0
[e3 b694/2315] loss=1.5177 avg=1.2949 it/s=446.8
[e3 b925/2315] loss=1.3460 avg=1.3143 it/s=474.4
[e3 b1156/2315] loss=1.3050 avg=1.3120 it/s=455.5
[e3 b1387/2315] loss=1.1758 avg=1.3087 it/s=463.4
[e3 b1618/2315] loss=1.2069 avg=1.3030 it/s=462.5
[e3 b1849/2315] loss=1.3718 avg=1.2982 it/s=466.1
[e3 b2080/2315] loss=1.2080 avg=1.2962 it/s=464.3
[e3 b2311/2315] loss=1.3950 avg=1.2941 it/s=459.7


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 3/7 | loss=1.2940 | val_acc=0.4096 | val_f1=0.2783 | time=86.0s
[e4 b1/2315] loss=1.1987 avg=1.1987 it/s=581.6
[e4 b2/2315] loss=1.4514 avg=1.3250 it/s=724.8


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e4 b232/2315] loss=1.4176 avg=1.2998 it/s=458.9
[e4 b463/2315] loss=1.2092 avg=1.2742 it/s=462.1
[e4 b694/2315] loss=1.2533 avg=1.2694 it/s=444.2
[e4 b925/2315] loss=1.7008 avg=1.2683 it/s=455.8
[e4 b1156/2315] loss=1.1732 avg=1.2619 it/s=444.0
[e4 b1387/2315] loss=1.0574 avg=1.2589 it/s=453.7
[e4 b1618/2315] loss=1.4307 avg=1.2553 it/s=445.7
[e4 b1849/2315] loss=1.6608 avg=1.2528 it/s=434.5
[e4 b2080/2315] loss=1.4488 avg=1.2526 it/s=445.9
[e4 b2311/2315] loss=1.0142 avg=1.2531 it/s=455.6


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 4/7 | loss=1.2532 | val_acc=0.4402 | val_f1=0.3610 | time=86.1s
[e5 b1/2315] loss=1.2360 avg=1.2360 it/s=544.3
[e5 b2/2315] loss=1.3085 avg=1.2722 it/s=445.1


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e5 b232/2315] loss=1.1573 avg=1.2935 it/s=522.1
[e5 b463/2315] loss=1.3296 avg=1.2790 it/s=533.2
[e5 b694/2315] loss=1.2959 avg=1.2734 it/s=534.7
[e5 b925/2315] loss=1.1096 avg=1.2580 it/s=540.1
[e5 b1156/2315] loss=1.1865 avg=1.2471 it/s=547.1
[e5 b1387/2315] loss=0.8888 avg=1.2419 it/s=551.9
[e5 b1618/2315] loss=1.2276 avg=1.2367 it/s=554.3
[e5 b1849/2315] loss=1.2433 avg=1.2321 it/s=556.5
[e5 b2080/2315] loss=1.2693 avg=1.2309 it/s=555.7
[e5 b2311/2315] loss=1.0523 avg=1.2254 it/s=552.6


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 5/7 | loss=1.2256 | val_acc=0.4866 | val_f1=0.4621 | time=71.7s
[e6 b1/2315] loss=1.0373 avg=1.0373 it/s=514.5
[e6 b2/2315] loss=1.0960 avg=1.0667 it/s=538.0


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e6 b232/2315] loss=1.1185 avg=1.1611 it/s=546.4
[e6 b463/2315] loss=1.0538 avg=1.1635 it/s=550.7
[e6 b694/2315] loss=1.1240 avg=1.1669 it/s=545.3
[e6 b925/2315] loss=1.1766 avg=1.1670 it/s=546.1
[e6 b1156/2315] loss=1.3895 avg=1.1702 it/s=544.5
[e6 b1387/2315] loss=1.1469 avg=1.1706 it/s=528.3
[e6 b1618/2315] loss=1.1224 avg=1.1692 it/s=499.1
[e6 b1849/2315] loss=0.9653 avg=1.1674 it/s=506.5
[e6 b2080/2315] loss=1.3382 avg=1.1649 it/s=512.4
[e6 b2311/2315] loss=1.2507 avg=1.1652 it/s=517.8


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 6/7 | loss=1.1651 | val_acc=0.5141 | val_f1=0.4906 | time=76.2s
[e7 b1/2315] loss=0.9426 avg=0.9426 it/s=529.8
[e7 b2/2315] loss=0.9681 avg=0.9554 it/s=576.6


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e7 b232/2315] loss=1.5436 avg=1.1435 it/s=556.5
[e7 b463/2315] loss=0.9565 avg=1.1445 it/s=559.2
[e7 b694/2315] loss=0.8839 avg=1.1370 it/s=559.5
[e7 b925/2315] loss=1.0765 avg=1.1356 it/s=560.0
[e7 b1156/2315] loss=1.0799 avg=1.1346 it/s=553.1
[e7 b1387/2315] loss=0.8935 avg=1.1352 it/s=549.0
[e7 b1618/2315] loss=1.3306 avg=1.1372 it/s=548.3
[e7 b1849/2315] loss=1.0287 avg=1.1390 it/s=548.6
[e7 b2080/2315] loss=0.8820 avg=1.1370 it/s=549.8
[e7 b2311/2315] loss=1.0866 avg=1.1337 it/s=551.4


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
[34m[1mwandb[0m: [32m[41mERROR[0m The nbformat package was not found. It is required to save notebook history.


Epoch 7/7 | loss=1.1339 | val_acc=0.5292 | val_f1=0.5295 | time=71.8s


0,1
epoch,▁▁▁▁▁▁▁▁▁▂▂▂▂▂▂▂▃▃▃▅▅▅▆▆▆▆▆▇▇▇▇▇▇▇▇█████
lr,█▇▆▅▃▂▁
params/ratio,▁
params/total,▁
params/trainable,▁
step,▁▁▂▂▂▂▂▁▂▂▁▁▁▂▂▂▂▂▆▂▂▆▁▇▂▃█▂▂▂▁▂▂▂▃▁▂▂▂▃
time/epoch_sec,█▇▆▆▁▂▁
train/avg_loss_so_far,██▄▄▄▄▁▃▃▃▃▃▃▄▄▃▃▃▂▄▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▁▁▁▁▁
train/epoch_loss,█▆▆▅▄▂▁
train/items_per_sec,▂▃▃▇▃▁▁▁▁▂▃▅▃▃▃▃▃▅█▃▃▃▃▄▄▅▅▅▅▄▄▄▄▄▄▅▅▅▅▅

0,1
best_checkpoint_path,checkpoints_midf1\be...
best_val_f1,0.55514
best_val_mid_f1,0.55514
epoch,7
lr,0
params/ratio,0.10382
params/total,278813189
params/trainable,28945925
step,16201
time/epoch_sec,71.83584


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
Best trial: 2. Best value: 0.85947:  40%|████      | 8/20 [1:17:12<1:53:41, 568.44s/it]

[Trial 7] f1=0.4906 | unfreeze_k=4 lr=2.68e-04 wd=4.9e-06 suggested_bs=4
[I 2025-08-17 19:05:15,120] Trial 7 finished with value: 0.49057592373421377 and parameters: {'num_unfreeze_last_layers': 4, 'lr': 0.0002678536574669438, 'weight_decay': 4.92687088956682e-06, 'batch_size': 4}. Best is trial 2 with value: 0.8594704602547099.


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/mdeberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  scaler = GradScaler(enabled=(DEVICE == "cuda" and USE_AMP))


Trainable params: 85,648,901 / 278,813,189 (30.72%) ; unfreeze_last_k=12
[e1 b1/2315] loss=1.5869 avg=1.5869 it/s=124.2
[e1 b2/2315] loss=1.6461 avg=1.6165 it/s=172.3


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e1 b232/2315] loss=1.5838 avg=1.5524 it/s=348.0
[e1 b463/2315] loss=1.2076 avg=1.4984 it/s=354.5
[e1 b694/2315] loss=0.9737 avg=1.4070 it/s=357.2
[e1 b925/2315] loss=0.8820 avg=1.3202 it/s=359.0
[e1 b1156/2315] loss=0.9148 avg=1.2488 it/s=360.8
[e1 b1387/2315] loss=0.9874 avg=1.1901 it/s=360.9
[e1 b1618/2315] loss=0.6321 avg=1.1409 it/s=357.6
[e1 b1849/2315] loss=1.0019 avg=1.1005 it/s=353.5
[e1 b2080/2315] loss=0.5528 avg=1.0660 it/s=348.7
[e1 b2311/2315] loss=0.3606 avg=1.0331 it/s=349.7


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 1/7 | loss=1.0327 | val_acc=0.7670 | val_f1=0.7736 | time=110.7s
[e2 b1/2315] loss=0.6760 avg=0.6760 it/s=319.3
[e2 b2/2315] loss=0.6643 avg=0.6701 it/s=329.3


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e2 b232/2315] loss=0.7286 avg=0.6786 it/s=352.1
[e2 b463/2315] loss=0.4689 avg=0.6860 it/s=333.5
[e2 b694/2315] loss=1.1900 avg=0.6843 it/s=302.1
[e2 b925/2315] loss=0.4295 avg=0.6780 it/s=312.6
[e2 b1156/2315] loss=0.7583 avg=0.6702 it/s=321.9
[e2 b1387/2315] loss=0.4107 avg=0.6648 it/s=328.1
[e2 b1618/2315] loss=0.7407 avg=0.6606 it/s=332.0
[e2 b1849/2315] loss=0.6449 avg=0.6571 it/s=334.8
[e2 b2080/2315] loss=0.9956 avg=0.6515 it/s=335.9
[e2 b2311/2315] loss=0.6844 avg=0.6473 it/s=337.1


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 2/7 | loss=0.6480 | val_acc=0.8241 | val_f1=0.8299 | time=114.5s
[e3 b1/2315] loss=0.6339 avg=0.6339 it/s=313.0
[e3 b2/2315] loss=0.6148 avg=0.6243 it/s=329.7


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e3 b232/2315] loss=0.4370 avg=0.5583 it/s=339.5
[e3 b463/2315] loss=0.4626 avg=0.5539 it/s=342.1
[e3 b694/2315] loss=0.3570 avg=0.5536 it/s=347.3
[e3 b925/2315] loss=1.3451 avg=0.5519 it/s=349.3
[e3 b1156/2315] loss=0.4050 avg=0.5446 it/s=351.7
[e3 b1387/2315] loss=0.7867 avg=0.5419 it/s=353.2
[e3 b1618/2315] loss=0.4217 avg=0.5399 it/s=353.9
[e3 b1849/2315] loss=0.5424 avg=0.5401 it/s=352.9
[e3 b2080/2315] loss=0.6636 avg=0.5387 it/s=351.9
[e3 b2311/2315] loss=0.4487 avg=0.5358 it/s=350.9


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 3/7 | loss=0.5358 | val_acc=0.8540 | val_f1=0.8584 | time=110.3s
[e4 b1/2315] loss=0.3715 avg=0.3715 it/s=350.3
[e4 b2/2315] loss=0.4382 avg=0.4048 it/s=325.4


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e4 b232/2315] loss=0.3335 avg=0.4644 it/s=361.0
[e4 b463/2315] loss=0.3084 avg=0.4648 it/s=361.6
[e4 b694/2315] loss=0.2484 avg=0.4654 it/s=359.4
[e4 b925/2315] loss=0.7294 avg=0.4665 it/s=360.4
[e4 b1156/2315] loss=0.3830 avg=0.4651 it/s=360.8
[e4 b1387/2315] loss=0.2201 avg=0.4666 it/s=360.5
[e4 b1618/2315] loss=0.2773 avg=0.4671 it/s=360.7
[e4 b1849/2315] loss=0.4593 avg=0.4651 it/s=360.7
[e4 b2080/2315] loss=0.2953 avg=0.4630 it/s=357.1
[e4 b2311/2315] loss=0.5888 avg=0.4643 it/s=354.3


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 4/7 | loss=0.4644 | val_acc=0.8533 | val_f1=0.8561 | time=109.4s
[e5 b1/2315] loss=0.8881 avg=0.8881 it/s=365.3
[e5 b2/2315] loss=0.2639 avg=0.5760 it/s=370.3


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e5 b232/2315] loss=0.4027 avg=0.4297 it/s=338.2
[e5 b463/2315] loss=0.6736 avg=0.4242 it/s=336.3
[e5 b694/2315] loss=0.2632 avg=0.4204 it/s=287.7
[e5 b925/2315] loss=0.5794 avg=0.4171 it/s=274.6
[e5 b1156/2315] loss=0.5390 avg=0.4173 it/s=286.1
[e5 b1387/2315] loss=0.4480 avg=0.4148 it/s=295.6
[e5 b1618/2315] loss=0.2828 avg=0.4155 it/s=303.7
[e5 b1849/2315] loss=0.3955 avg=0.4142 it/s=309.9
[e5 b2080/2315] loss=0.5345 avg=0.4156 it/s=312.5
[e5 b2311/2315] loss=0.5164 avg=0.4155 it/s=315.5


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 5/7 | loss=0.4156 | val_acc=0.8516 | val_f1=0.8557 | time=122.2s
[e6 b1/2315] loss=0.2282 avg=0.2282 it/s=363.0
[e6 b2/2315] loss=0.3870 avg=0.3076 it/s=339.8


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e6 b232/2315] loss=0.2177 avg=0.3746 it/s=345.1
[e6 b463/2315] loss=0.2085 avg=0.3813 it/s=341.2
[e6 b694/2315] loss=0.2628 avg=0.3799 it/s=342.4
[e6 b925/2315] loss=0.5898 avg=0.3787 it/s=344.4
[e6 b1156/2315] loss=0.2149 avg=0.3813 it/s=344.6
[e6 b1387/2315] loss=0.3413 avg=0.3798 it/s=337.4
[e6 b1618/2315] loss=0.2083 avg=0.3797 it/s=327.8
[e6 b1849/2315] loss=0.2337 avg=0.3795 it/s=325.1
[e6 b2080/2315] loss=0.3658 avg=0.3817 it/s=327.7
[e6 b2311/2315] loss=0.3595 avg=0.3823 it/s=329.1


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 6/7 | loss=0.3826 | val_acc=0.8664 | val_f1=0.8691 | time=117.3s
[e7 b1/2315] loss=0.2195 avg=0.2195 it/s=287.6
[e7 b2/2315] loss=0.2153 avg=0.2174 it/s=268.0


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e7 b232/2315] loss=0.5247 avg=0.3689 it/s=349.0
[e7 b463/2315] loss=0.2537 avg=0.3598 it/s=353.9
[e7 b694/2315] loss=0.5717 avg=0.3620 it/s=354.4
[e7 b925/2315] loss=0.4997 avg=0.3599 it/s=324.0
[e7 b1156/2315] loss=0.3738 avg=0.3582 it/s=330.2
[e7 b1387/2315] loss=0.3398 avg=0.3553 it/s=334.6
[e7 b1618/2315] loss=0.2102 avg=0.3527 it/s=338.1
[e7 b1849/2315] loss=0.4984 avg=0.3524 it/s=341.5
[e7 b2080/2315] loss=0.5323 avg=0.3513 it/s=344.0
[e7 b2311/2315] loss=0.2315 avg=0.3503 it/s=341.5


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
[34m[1mwandb[0m: [32m[41mERROR[0m The nbformat package was not found. It is required to save notebook history.


Epoch 7/7 | loss=0.3502 | val_acc=0.8686 | val_f1=0.8714 | time=113.2s


0,1
epoch,▁▁▂▂▂▂▂▂▃▃▃▅▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇▇██████████
lr,█▇▆▅▃▂▁
params/ratio,▁
params/total,▁
params/trainable,▁
step,▁▁▁▁▁▂▂▂▁▁▁▁▂▂▂▁▁▁▂▁▁▁▁▅▂▁▂▂▇▁▁▂▂▂█▁▁▂▂▂
time/epoch_sec,▂▄▁▁█▅▃
train/avg_loss_so_far,██▇▇▆▅▅▃▃▃▃▃▃▃▃▃▃▃▃▃▃▂▂▂▂▂▄▃▂▂▂▂▁▂▂▁▂▂▂▂
train/epoch_loss,█▄▃▂▂▁▁
train/items_per_sec,▁▇████▇▆▇▇▇▇█████▇██████▇▅▆▇█▇▇▇▇▇▇███▇▇

0,1
best_checkpoint_path,checkpoints_midf1\be...
best_val_f1,0.86112
best_val_mid_f1,0.86112
epoch,7
lr,0
params/ratio,0.30719
params/total,278813189
params/trainable,85648901
step,16201
time/epoch_sec,113.17197


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
Best trial: 8. Best value: 0.871425:  45%|████▌     | 9/20 [1:30:43<1:58:08, 644.44s/it]

[Trial 8] f1=0.8714 | unfreeze_k=12 lr=2.97e-05 wd=4.3e-06 suggested_bs=8
[I 2025-08-17 19:18:46,667] Trial 8 finished with value: 0.8714249088849261 and parameters: {'num_unfreeze_last_layers': 12, 'lr': 2.9687864289331742e-05, 'weight_decay': 4.253134472368673e-06, 'batch_size': 8}. Best is trial 8 with value: 0.8714249088849261.


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/mdeberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  scaler = GradScaler(enabled=(DEVICE == "cuda" and USE_AMP))


Trainable params: 85,648,901 / 278,813,189 (30.72%) ; unfreeze_last_k=12
[e1 b1/2315] loss=1.6363 avg=1.6363 it/s=238.3
[e1 b2/2315] loss=1.6445 avg=1.6404 it/s=265.7


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e1 b232/2315] loss=1.4707 avg=1.4974 it/s=339.4
[e1 b463/2315] loss=1.3856 avg=1.5004 it/s=347.5
[e1 b694/2315] loss=1.5249 avg=1.5037 it/s=351.1
[e1 b925/2315] loss=1.3192 avg=1.5047 it/s=350.2
[e1 b1156/2315] loss=1.5637 avg=1.5048 it/s=348.2
[e1 b1387/2315] loss=1.3644 avg=1.5053 it/s=346.0
[e1 b1618/2315] loss=1.3868 avg=1.5037 it/s=348.6
[e1 b1849/2315] loss=1.5552 avg=1.5032 it/s=350.2
[e1 b2080/2315] loss=1.4495 avg=1.5032 it/s=350.7
[e1 b2311/2315] loss=1.3749 avg=1.5037 it/s=350.5


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 1/7 | loss=1.5037 | val_acc=0.2775 | val_f1=0.0869 | time=110.4s
[e2 b1/2315] loss=1.7497 avg=1.7497 it/s=315.0
[e2 b2/2315] loss=1.4981 avg=1.6239 it/s=320.7


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e2 b232/2315] loss=1.3906 avg=1.5004 it/s=341.6
[e2 b463/2315] loss=1.5096 avg=1.4963 it/s=343.4
[e2 b694/2315] loss=1.3717 avg=1.4950 it/s=330.9
[e2 b925/2315] loss=1.4459 avg=1.4961 it/s=304.1
[e2 b1156/2315] loss=1.5988 avg=1.4980 it/s=289.1
[e2 b1387/2315] loss=1.6096 avg=1.4977 it/s=285.9
[e2 b1618/2315] loss=1.4001 avg=1.4979 it/s=294.3
[e2 b1849/2315] loss=1.6339 avg=1.4983 it/s=301.2
[e2 b2080/2315] loss=1.5350 avg=1.4978 it/s=305.4
[e2 b2311/2315] loss=1.4536 avg=1.4983 it/s=308.9


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 2/7 | loss=1.4985 | val_acc=0.2775 | val_f1=0.0869 | time=124.9s
[e3 b1/2315] loss=1.6302 avg=1.6302 it/s=363.8
[e3 b2/2315] loss=1.4320 avg=1.5311 it/s=372.9


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e3 b232/2315] loss=1.4877 avg=1.5083 it/s=345.4
[e3 b463/2315] loss=1.5270 avg=1.4991 it/s=354.2
[e3 b694/2315] loss=1.6186 avg=1.5019 it/s=357.6
[e3 b925/2315] loss=1.3500 avg=1.4992 it/s=357.1
[e3 b1156/2315] loss=1.4046 avg=1.4990 it/s=357.2
[e3 b1387/2315] loss=1.5631 avg=1.4973 it/s=358.1
[e3 b1618/2315] loss=1.6301 avg=1.4987 it/s=359.3
[e3 b1849/2315] loss=1.6405 avg=1.4986 it/s=360.2
[e3 b2080/2315] loss=1.4235 avg=1.4978 it/s=360.6
[e3 b2311/2315] loss=1.5187 avg=1.4977 it/s=359.3


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 3/7 | loss=1.4976 | val_acc=0.2775 | val_f1=0.0869 | time=108.0s
[e4 b1/2315] loss=1.4245 avg=1.4245 it/s=283.0
[e4 b2/2315] loss=1.5576 avg=1.4910 it/s=281.8


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e4 b232/2315] loss=1.6032 avg=1.4980 it/s=325.8
[e4 b463/2315] loss=1.4053 avg=1.5003 it/s=331.9
[e4 b694/2315] loss=1.7446 avg=1.5007 it/s=340.8
[e4 b925/2315] loss=1.4561 avg=1.4981 it/s=344.0
[e4 b1156/2315] loss=1.5108 avg=1.4988 it/s=343.7
[e4 b1387/2315] loss=1.4383 avg=1.4994 it/s=341.5
[e4 b1618/2315] loss=1.3477 avg=1.4992 it/s=337.3
[e4 b1849/2315] loss=1.5024 avg=1.4985 it/s=316.3
[e4 b2080/2315] loss=1.6564 avg=1.4974 it/s=320.7
[e4 b2311/2315] loss=1.5096 avg=1.4962 it/s=323.3


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
[34m[1mwandb[0m: [32m[41mERROR[0m The nbformat package was not found. It is required to save notebook history.


Early stopping at epoch 4


0,1
epoch,▁▁▁▁▁▁▁▁▁▁▁▃▃▃▃▃▃▃▃▃▃▃▃▆▆▆▆▆▆▆██████████
lr,█▆▃▁
params/ratio,▁
params/total,▁
params/trainable,▁
step,▁▁▂▂▂▂▃▃▁▂▂▂▂▂▃▂▂▂▂▂▃▃▃██▁▁▁▂▂▂▂▂▂▂▂▃▃▃▃
time/epoch_sec,▂█▁▆
train/avg_loss_so_far,▆▆▃▃▃▃▃▃▃▃█▅▃▃▃▃▃▃▃▃▅▃▃▃▃▃▃▃▃▃▁▂▃▃▃▃▃▃▃▃
train/epoch_loss,█▃▂▁
train/items_per_sec,▁▂▆▇▇▇▇▇▇▇▅▅▆▆▆▄▃▄▄▄██▇▇▇▇▇▇▇▇▃▃▆▆▆▆▆▆▅▅

0,1
best_checkpoint_path,checkpoints_midf1\be...
best_val_f1,0.1448
best_val_mid_f1,0.1448
epoch,4
lr,0.00026
params/ratio,0.30719
params/total,278813189
params/trainable,85648901
step,9256
time/epoch_sec,119.21406


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
Best trial: 8. Best value: 0.871425:  50%|█████     | 10/20 [1:38:35<1:38:32, 591.21s/it]

[Trial 9] f1=0.0869 | unfreeze_k=12 lr=5.68e-04 wd=4.1e-06 suggested_bs=16
[I 2025-08-17 19:26:38,695] Trial 9 finished with value: 0.08687713959680486 and parameters: {'num_unfreeze_last_layers': 12, 'lr': 0.0005684036256649225, 'weight_decay': 4.108463704855652e-06, 'batch_size': 16}. Best is trial 8 with value: 0.8714249088849261.


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/mdeberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  scaler = GradScaler(enabled=(DEVICE == "cuda" and USE_AMP))


Trainable params: 71,473,157 / 278,813,189 (25.63%) ; unfreeze_last_k=10
[e1 b1/2315] loss=1.6156 avg=1.6156 it/s=124.1
[e1 b2/2315] loss=1.5684 avg=1.5920 it/s=169.7


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e1 b232/2315] loss=1.6434 avg=1.5322 it/s=360.6
[e1 b463/2315] loss=1.3232 avg=1.4731 it/s=373.1
[e1 b694/2315] loss=1.2018 avg=1.3889 it/s=366.4
[e1 b925/2315] loss=1.3401 avg=1.3098 it/s=331.4
[e1 b1156/2315] loss=0.7554 avg=1.2457 it/s=306.5
[e1 b1387/2315] loss=0.7943 avg=1.1893 it/s=319.1
[e1 b1618/2315] loss=0.6966 avg=1.1444 it/s=328.4
[e1 b1849/2315] loss=0.9882 avg=1.1038 it/s=335.8
[e1 b2080/2315] loss=1.2701 avg=1.0674 it/s=341.6
[e1 b2311/2315] loss=0.6656 avg=1.0369 it/s=345.7


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 1/7 | loss=1.0364 | val_acc=0.7748 | val_f1=0.7785 | time=111.7s
[e2 b1/2315] loss=0.4861 avg=0.4861 it/s=337.7
[e2 b2/2315] loss=0.7544 avg=0.6202 it/s=313.8


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e2 b232/2315] loss=0.7074 avg=0.6971 it/s=378.5
[e2 b463/2315] loss=1.1013 avg=0.6856 it/s=330.6
[e2 b694/2315] loss=0.9703 avg=0.6869 it/s=302.6
[e2 b925/2315] loss=0.3881 avg=0.6806 it/s=306.4
[e2 b1156/2315] loss=0.6920 avg=0.6762 it/s=319.3
[e2 b1387/2315] loss=0.9065 avg=0.6737 it/s=330.8
[e2 b1618/2315] loss=0.5932 avg=0.6703 it/s=339.2
[e2 b1849/2315] loss=0.6494 avg=0.6671 it/s=345.5
[e2 b2080/2315] loss=0.4909 avg=0.6624 it/s=350.9
[e2 b2311/2315] loss=0.9188 avg=0.6587 it/s=355.2


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 2/7 | loss=0.6587 | val_acc=0.8207 | val_f1=0.8252 | time=108.9s
[e3 b1/2315] loss=0.7678 avg=0.7678 it/s=389.0
[e3 b2/2315] loss=0.4125 avg=0.5901 it/s=389.3


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e3 b232/2315] loss=0.5112 avg=0.5857 it/s=359.4
[e3 b463/2315] loss=0.4716 avg=0.5684 it/s=342.0
[e3 b694/2315] loss=0.4176 avg=0.5617 it/s=301.8
[e3 b925/2315] loss=0.6818 avg=0.5549 it/s=300.2
[e3 b1156/2315] loss=0.5411 avg=0.5535 it/s=299.9
[e3 b1387/2315] loss=0.5632 avg=0.5531 it/s=309.5
[e3 b1618/2315] loss=0.6070 avg=0.5510 it/s=315.8
[e3 b1849/2315] loss=0.4538 avg=0.5508 it/s=322.2
[e3 b2080/2315] loss=0.8389 avg=0.5502 it/s=329.6
[e3 b2311/2315] loss=0.5144 avg=0.5496 it/s=335.5


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 3/7 | loss=0.5494 | val_acc=0.8353 | val_f1=0.8404 | time=114.9s
[e4 b1/2315] loss=0.4670 avg=0.4670 it/s=353.6
[e4 b2/2315] loss=0.2436 avg=0.3553 it/s=339.3


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e4 b232/2315] loss=0.3907 avg=0.4733 it/s=380.8
[e4 b463/2315] loss=0.3101 avg=0.4758 it/s=295.8
[e4 b694/2315] loss=0.4941 avg=0.4766 it/s=285.0
[e4 b925/2315] loss=0.5627 avg=0.4789 it/s=283.1
[e4 b1156/2315] loss=0.4965 avg=0.4795 it/s=280.1
[e4 b1387/2315] loss=0.3147 avg=0.4756 it/s=294.9
[e4 b1618/2315] loss=0.3505 avg=0.4796 it/s=306.1
[e4 b1849/2315] loss=0.2782 avg=0.4780 it/s=315.1
[e4 b2080/2315] loss=0.3828 avg=0.4794 it/s=322.5
[e4 b2311/2315] loss=0.5637 avg=0.4800 it/s=328.8


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 4/7 | loss=0.4801 | val_acc=0.8547 | val_f1=0.8590 | time=117.2s
[e5 b1/2315] loss=0.3297 avg=0.3297 it/s=354.4
[e5 b2/2315] loss=0.3668 avg=0.3482 it/s=410.8


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e5 b232/2315] loss=0.4464 avg=0.4167 it/s=385.0
[e5 b463/2315] loss=0.5801 avg=0.4234 it/s=373.9
[e5 b694/2315] loss=0.4655 avg=0.4276 it/s=373.6
[e5 b925/2315] loss=0.3504 avg=0.4260 it/s=354.3
[e5 b1156/2315] loss=0.4696 avg=0.4254 it/s=344.2
[e5 b1387/2315] loss=0.3001 avg=0.4275 it/s=352.3
[e5 b1618/2315] loss=0.3307 avg=0.4307 it/s=358.4
[e5 b1849/2315] loss=0.5193 avg=0.4299 it/s=362.6
[e5 b2080/2315] loss=0.2426 avg=0.4288 it/s=366.9
[e5 b2311/2315] loss=0.4049 avg=0.4296 it/s=370.4


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 5/7 | loss=0.4296 | val_acc=0.8562 | val_f1=0.8605 | time=104.5s
[e6 b1/2315] loss=0.2074 avg=0.2074 it/s=408.2
[e6 b2/2315] loss=0.4505 avg=0.3289 it/s=412.0


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e6 b232/2315] loss=0.2002 avg=0.4069 it/s=408.0
[e6 b463/2315] loss=0.5540 avg=0.3991 it/s=407.3
[e6 b694/2315] loss=0.5322 avg=0.4026 it/s=381.4
[e6 b925/2315] loss=0.2313 avg=0.3977 it/s=358.5
[e6 b1156/2315] loss=0.2900 avg=0.3985 it/s=349.7
[e6 b1387/2315] loss=0.6179 avg=0.3975 it/s=340.5
[e6 b1618/2315] loss=0.3719 avg=0.3960 it/s=334.5
[e6 b1849/2315] loss=0.3158 avg=0.3939 it/s=337.9
[e6 b2080/2315] loss=0.6158 avg=0.3936 it/s=341.3
[e6 b2311/2315] loss=0.3862 avg=0.3929 it/s=344.1


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 6/7 | loss=0.3932 | val_acc=0.8618 | val_f1=0.8655 | time=112.3s
[e7 b1/2315] loss=0.2115 avg=0.2115 it/s=355.2
[e7 b2/2315] loss=0.3424 avg=0.2769 it/s=367.4


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e7 b232/2315] loss=0.5132 avg=0.3565 it/s=403.1
[e7 b463/2315] loss=0.4525 avg=0.3623 it/s=400.0
[e7 b694/2315] loss=0.6282 avg=0.3666 it/s=400.7
[e7 b925/2315] loss=0.2071 avg=0.3701 it/s=393.5
[e7 b1156/2315] loss=0.4035 avg=0.3654 it/s=390.8
[e7 b1387/2315] loss=0.4652 avg=0.3666 it/s=387.7
[e7 b1618/2315] loss=0.2111 avg=0.3651 it/s=385.5
[e7 b1849/2315] loss=0.3821 avg=0.3662 it/s=384.9
[e7 b2080/2315] loss=0.2196 avg=0.3670 it/s=384.9
[e7 b2311/2315] loss=0.3492 avg=0.3673 it/s=385.8


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
[34m[1mwandb[0m: [32m[41mERROR[0m The nbformat package was not found. It is required to save notebook history.


Epoch 7/7 | loss=0.3671 | val_acc=0.8591 | val_f1=0.8622 | time=100.6s


0,1
epoch,▁▁▁▁▁▂▂▂▂▂▃▃▃▃▃▅▅▅▅▅▆▆▆▆▆▆▆▇▇▇▇▇▇███████
lr,█▇▆▅▃▂▁
params/ratio,▁
params/total,▁
params/trainable,▁
step,▁▁▁▁▁▂▂▂▂▁▁▁▁▂▂▃▁▂▂▂▁▂▂▂▂▂▆▂▂▂▁▂▂▂▂▁█▁▂▂
time/epoch_sec,▆▄▇█▃▆▁
train/avg_loss_so_far,█▇▇▆▆▄▄▄▄▄▃▃▃▃▃▂▂▂▃▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▂▂▂▂▂
train/epoch_loss,█▄▃▂▂▁▁
train/items_per_sec,▁▂▇▇▅▆▆▆▅▅▆▇▇▅▅▆▆▆▆▅▅▇█▇▇▇▇▇▇█▆▆▆▇▇██▇▇▇

0,1
best_checkpoint_path,checkpoints_midf1\be...
best_val_f1,0.8546
best_val_mid_f1,0.8546
epoch,7
lr,0
params/ratio,0.25635
params/total,278813189
params/trainable,71473157
step,16201
time/epoch_sec,100.57654


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
Best trial: 8. Best value: 0.871425:  55%|█████▌    | 11/20 [1:51:39<1:37:30, 650.03s/it]

[Trial 10] f1=0.8655 | unfreeze_k=10 lr=3.00e-05 wd=9.1e-06 suggested_bs=8
[I 2025-08-17 19:39:42,074] Trial 10 finished with value: 0.8654852884932753 and parameters: {'num_unfreeze_last_layers': 10, 'lr': 3.002191102461359e-05, 'weight_decay': 9.093319053371251e-06, 'batch_size': 8}. Best is trial 8 with value: 0.8714249088849261.


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/mdeberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  scaler = GradScaler(enabled=(DEVICE == "cuda" and USE_AMP))


Trainable params: 71,473,157 / 278,813,189 (25.63%) ; unfreeze_last_k=10
[e1 b1/2315] loss=1.6056 avg=1.6056 it/s=278.7
[e1 b2/2315] loss=1.6399 avg=1.6228 it/s=286.0


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e1 b232/2315] loss=1.6154 avg=1.5470 it/s=394.5
[e1 b463/2315] loss=1.1292 avg=1.4666 it/s=400.3
[e1 b694/2315] loss=1.2161 avg=1.3814 it/s=401.0
[e1 b925/2315] loss=0.7353 avg=1.2981 it/s=400.4
[e1 b1156/2315] loss=0.7831 avg=1.2329 it/s=395.6
[e1 b1387/2315] loss=0.7322 avg=1.1810 it/s=391.6
[e1 b1618/2315] loss=0.9182 avg=1.1303 it/s=388.5
[e1 b1849/2315] loss=0.6913 avg=1.0909 it/s=389.8
[e1 b2080/2315] loss=1.0932 avg=1.0564 it/s=390.9
[e1 b2311/2315] loss=0.6780 avg=1.0283 it/s=391.8


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 1/7 | loss=1.0282 | val_acc=0.7459 | val_f1=0.7571 | time=99.2s
[e2 b1/2315] loss=0.5829 avg=0.5829 it/s=386.4
[e2 b2/2315] loss=0.6910 avg=0.6370 it/s=416.1


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e2 b232/2315] loss=0.6215 avg=0.6911 it/s=399.7
[e2 b463/2315] loss=0.6489 avg=0.6914 it/s=397.7
[e2 b694/2315] loss=0.5971 avg=0.6936 it/s=398.2
[e2 b925/2315] loss=1.1517 avg=0.6874 it/s=395.4
[e2 b1156/2315] loss=0.4284 avg=0.6815 it/s=395.0
[e2 b1387/2315] loss=0.5785 avg=0.6747 it/s=393.0
[e2 b1618/2315] loss=0.6605 avg=0.6664 it/s=387.3
[e2 b1849/2315] loss=0.7512 avg=0.6617 it/s=383.0
[e2 b2080/2315] loss=0.3534 avg=0.6553 it/s=379.0
[e2 b2311/2315] loss=0.4304 avg=0.6541 it/s=378.2


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 2/7 | loss=0.6541 | val_acc=0.8277 | val_f1=0.8329 | time=102.8s
[e3 b1/2315] loss=0.4406 avg=0.4406 it/s=328.7
[e3 b2/2315] loss=0.3705 avg=0.4055 it/s=318.7


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e3 b232/2315] loss=0.4205 avg=0.5639 it/s=395.0
[e3 b463/2315] loss=0.4288 avg=0.5625 it/s=395.4
[e3 b694/2315] loss=0.5728 avg=0.5613 it/s=377.5
[e3 b925/2315] loss=0.6118 avg=0.5572 it/s=369.9
[e3 b1156/2315] loss=0.8364 avg=0.5540 it/s=372.1
[e3 b1387/2315] loss=0.7825 avg=0.5526 it/s=376.3
[e3 b1618/2315] loss=0.2645 avg=0.5487 it/s=378.5
[e3 b1849/2315] loss=0.4898 avg=0.5476 it/s=379.8
[e3 b2080/2315] loss=0.5825 avg=0.5468 it/s=377.7
[e3 b2311/2315] loss=0.5097 avg=0.5467 it/s=377.1


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 3/7 | loss=0.5467 | val_acc=0.8479 | val_f1=0.8535 | time=103.0s
[e4 b1/2315] loss=0.2435 avg=0.2435 it/s=350.6
[e4 b2/2315] loss=0.7098 avg=0.4767 it/s=366.1


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e4 b232/2315] loss=0.2520 avg=0.4891 it/s=393.5
[e4 b463/2315] loss=0.7187 avg=0.4853 it/s=390.3
[e4 b694/2315] loss=0.5101 avg=0.4759 it/s=383.8
[e4 b925/2315] loss=0.4358 avg=0.4741 it/s=385.7
[e4 b1156/2315] loss=0.5665 avg=0.4735 it/s=383.4
[e4 b1387/2315] loss=0.2442 avg=0.4751 it/s=386.1
[e4 b1618/2315] loss=0.4391 avg=0.4751 it/s=388.2
[e4 b1849/2315] loss=0.2641 avg=0.4755 it/s=388.4
[e4 b2080/2315] loss=0.5964 avg=0.4733 it/s=388.9
[e4 b2311/2315] loss=0.5245 avg=0.4728 it/s=387.1


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 4/7 | loss=0.4729 | val_acc=0.8450 | val_f1=0.8500 | time=100.5s
[e5 b1/2315] loss=0.3674 avg=0.3674 it/s=262.4
[e5 b2/2315] loss=0.2416 avg=0.3045 it/s=352.4


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e5 b232/2315] loss=0.3529 avg=0.4219 it/s=365.3
[e5 b463/2315] loss=0.2224 avg=0.4354 it/s=374.7
[e5 b694/2315] loss=0.2712 avg=0.4341 it/s=382.3
[e5 b925/2315] loss=0.4261 avg=0.4301 it/s=387.1
[e5 b1156/2315] loss=0.3929 avg=0.4290 it/s=388.4
[e5 b1387/2315] loss=0.4022 avg=0.4281 it/s=390.8
[e5 b1618/2315] loss=0.9989 avg=0.4261 it/s=390.9
[e5 b1849/2315] loss=1.0038 avg=0.4238 it/s=390.8
[e5 b2080/2315] loss=0.2381 avg=0.4239 it/s=391.5
[e5 b2311/2315] loss=0.6660 avg=0.4234 it/s=391.8


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 5/7 | loss=0.4234 | val_acc=0.8525 | val_f1=0.8564 | time=99.3s
[e6 b1/2315] loss=0.3388 avg=0.3388 it/s=381.6
[e6 b2/2315] loss=0.4331 avg=0.3859 it/s=294.8


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e6 b232/2315] loss=0.4870 avg=0.3909 it/s=387.8
[e6 b463/2315] loss=0.2140 avg=0.3866 it/s=373.9
[e6 b694/2315] loss=0.4226 avg=0.3896 it/s=360.6
[e6 b925/2315] loss=0.2380 avg=0.3919 it/s=359.7
[e6 b1156/2315] loss=0.2009 avg=0.3912 it/s=362.6
[e6 b1387/2315] loss=0.2794 avg=0.3901 it/s=365.1
[e6 b1618/2315] loss=0.2292 avg=0.3892 it/s=365.1
[e6 b1849/2315] loss=0.3050 avg=0.3872 it/s=364.0
[e6 b2080/2315] loss=0.4279 avg=0.3872 it/s=364.4
[e6 b2311/2315] loss=0.2362 avg=0.3870 it/s=365.8


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 6/7 | loss=0.3868 | val_acc=0.8610 | val_f1=0.8644 | time=105.7s
[e7 b1/2315] loss=0.5216 avg=0.5216 it/s=333.7
[e7 b2/2315] loss=0.2426 avg=0.3821 it/s=360.9


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e7 b232/2315] loss=0.1916 avg=0.3635 it/s=398.3
[e7 b463/2315] loss=0.7737 avg=0.3572 it/s=396.2
[e7 b694/2315] loss=0.2829 avg=0.3478 it/s=393.8
[e7 b925/2315] loss=0.5042 avg=0.3507 it/s=385.7
[e7 b1156/2315] loss=0.6038 avg=0.3527 it/s=385.8
[e7 b1387/2315] loss=0.3061 avg=0.3501 it/s=384.8
[e7 b1618/2315] loss=0.3153 avg=0.3510 it/s=381.1
[e7 b1849/2315] loss=0.3202 avg=0.3495 it/s=379.8
[e7 b2080/2315] loss=0.2458 avg=0.3477 it/s=378.2
[e7 b2311/2315] loss=0.4781 avg=0.3480 it/s=379.0


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
[34m[1mwandb[0m: [32m[41mERROR[0m The nbformat package was not found. It is required to save notebook history.


Epoch 7/7 | loss=0.3480 | val_acc=0.8698 | val_f1=0.8728 | time=102.6s


0,1
epoch,▁▁▁▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▃▃▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇████
lr,█▇▆▅▃▂▁
params/ratio,▁
params/total,▁
params/trainable,▁
step,▁▁▁▁▁▂▂▂▁▁▂▂▃▂▂▂▁▁▁▂▂▂▁▁▂▂▆▁▁▁▂▂▂▂▁▂█▁▁▂
time/epoch_sec,▁▅▅▂▁█▅
train/avg_loss_so_far,█▇▇▆▅▂▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▁▁▂▂▂▂▂▁▁▁▁▁▁▂▁▁▁
train/epoch_loss,█▄▃▂▂▁▁
train/items_per_sec,███▇▇▇████▇▆▃█▆▇▇▅▆█▇▇▇▇▇▆▇▇▇▇▁▅▆▆▆▄▅█▇▇

0,1
best_checkpoint_path,checkpoints_midf1\be...
best_val_f1,0.86312
best_val_mid_f1,0.86312
epoch,7
lr,0
params/ratio,0.25635
params/total,278813189
params/trainable,71473157
step,16201
time/epoch_sec,102.55042


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
Best trial: 11. Best value: 0.872794:  60%|██████    | 12/20 [2:03:47<1:29:50, 673.78s/it]

[Trial 11] f1=0.8728 | unfreeze_k=10 lr=3.39e-05 wd=9.6e-06 suggested_bs=8
[I 2025-08-17 19:51:50,183] Trial 11 finished with value: 0.8727940336019774 and parameters: {'num_unfreeze_last_layers': 10, 'lr': 3.392673551748963e-05, 'weight_decay': 9.567965176138777e-06, 'batch_size': 8}. Best is trial 11 with value: 0.8727940336019774.


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/mdeberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  scaler = GradScaler(enabled=(DEVICE == "cuda" and USE_AMP))


Trainable params: 71,473,157 / 278,813,189 (25.63%) ; unfreeze_last_k=10
[e1 b1/2315] loss=1.6529 avg=1.6529 it/s=257.4
[e1 b2/2315] loss=1.6528 avg=1.6528 it/s=287.6


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e1 b232/2315] loss=1.4592 avg=1.5437 it/s=387.9
[e1 b463/2315] loss=1.5142 avg=1.4827 it/s=393.1
[e1 b694/2315] loss=1.0575 avg=1.4187 it/s=393.6
[e1 b925/2315] loss=1.0251 avg=1.3451 it/s=387.8
[e1 b1156/2315] loss=0.9380 avg=1.2777 it/s=383.5
[e1 b1387/2315] loss=0.8234 avg=1.2225 it/s=382.2
[e1 b1618/2315] loss=0.9640 avg=1.1721 it/s=383.5
[e1 b1849/2315] loss=0.7916 avg=1.1324 it/s=385.0
[e1 b2080/2315] loss=0.9532 avg=1.0976 it/s=387.1
[e1 b2311/2315] loss=0.6214 avg=1.0651 it/s=388.1


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 1/7 | loss=1.0645 | val_acc=0.7369 | val_f1=0.7450 | time=100.2s
[e2 b1/2315] loss=0.4914 avg=0.4914 it/s=270.9
[e2 b2/2315] loss=0.9000 avg=0.6957 it/s=320.6


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e2 b232/2315] loss=0.7574 avg=0.7276 it/s=399.4
[e2 b463/2315] loss=0.7938 avg=0.7142 it/s=396.6
[e2 b694/2315] loss=0.3721 avg=0.7104 it/s=397.9
[e2 b925/2315] loss=0.8552 avg=0.7051 it/s=399.5
[e2 b1156/2315] loss=0.9236 avg=0.6980 it/s=399.2
[e2 b1387/2315] loss=0.9837 avg=0.6924 it/s=393.7
[e2 b1618/2315] loss=0.5770 avg=0.6873 it/s=386.5
[e2 b1849/2315] loss=0.7878 avg=0.6798 it/s=382.0
[e2 b2080/2315] loss=0.4129 avg=0.6720 it/s=381.7
[e2 b2311/2315] loss=0.4433 avg=0.6676 it/s=382.3


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 2/7 | loss=0.6676 | val_acc=0.8195 | val_f1=0.8259 | time=101.4s
[e3 b1/2315] loss=0.3504 avg=0.3504 it/s=308.8
[e3 b2/2315] loss=0.4712 avg=0.4108 it/s=313.2


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e3 b232/2315] loss=0.2953 avg=0.5444 it/s=384.0
[e3 b463/2315] loss=0.2644 avg=0.5609 it/s=374.2
[e3 b694/2315] loss=0.5595 avg=0.5595 it/s=373.5
[e3 b925/2315] loss=0.4219 avg=0.5602 it/s=374.1
[e3 b1156/2315] loss=0.7099 avg=0.5569 it/s=380.6
[e3 b1387/2315] loss=0.4370 avg=0.5568 it/s=384.5
[e3 b1618/2315] loss=0.5057 avg=0.5580 it/s=386.6
[e3 b1849/2315] loss=0.6872 avg=0.5562 it/s=384.0
[e3 b2080/2315] loss=0.3111 avg=0.5535 it/s=383.8
[e3 b2311/2315] loss=0.5179 avg=0.5530 it/s=382.7


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 3/7 | loss=0.5533 | val_acc=0.8404 | val_f1=0.8459 | time=101.5s
[e4 b1/2315] loss=0.5237 avg=0.5237 it/s=269.2
[e4 b2/2315] loss=0.3083 avg=0.4160 it/s=315.7


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e4 b232/2315] loss=0.4022 avg=0.4929 it/s=374.3
[e4 b463/2315] loss=0.9099 avg=0.4856 it/s=380.9
[e4 b694/2315] loss=0.5428 avg=0.4829 it/s=376.6
[e4 b925/2315] loss=0.3970 avg=0.4842 it/s=381.6
[e4 b1156/2315] loss=0.3414 avg=0.4851 it/s=380.1
[e4 b1387/2315] loss=0.3661 avg=0.4897 it/s=381.2
[e4 b1618/2315] loss=0.6153 avg=0.4899 it/s=381.8
[e4 b1849/2315] loss=0.2909 avg=0.4890 it/s=382.9
[e4 b2080/2315] loss=0.6957 avg=0.4874 it/s=383.6
[e4 b2311/2315] loss=0.5269 avg=0.4850 it/s=381.4


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 4/7 | loss=0.4850 | val_acc=0.8528 | val_f1=0.8562 | time=101.9s
[e5 b1/2315] loss=0.2826 avg=0.2826 it/s=281.6
[e5 b2/2315] loss=0.4084 avg=0.3455 it/s=345.5


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e5 b232/2315] loss=0.2959 avg=0.4337 it/s=377.1
[e5 b463/2315] loss=0.2076 avg=0.4284 it/s=387.4
[e5 b694/2315] loss=0.2630 avg=0.4267 it/s=391.6
[e5 b925/2315] loss=0.2785 avg=0.4268 it/s=394.8
[e5 b1156/2315] loss=0.2251 avg=0.4310 it/s=394.7
[e5 b1387/2315] loss=0.4678 avg=0.4326 it/s=394.7
[e5 b1618/2315] loss=0.4111 avg=0.4348 it/s=395.0
[e5 b1849/2315] loss=0.4143 avg=0.4332 it/s=395.4
[e5 b2080/2315] loss=0.6073 avg=0.4332 it/s=396.6
[e5 b2311/2315] loss=0.3493 avg=0.4332 it/s=396.5


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 5/7 | loss=0.4332 | val_acc=0.8588 | val_f1=0.8622 | time=97.9s
[e6 b1/2315] loss=0.2642 avg=0.2642 it/s=348.8
[e6 b2/2315] loss=0.2015 avg=0.2329 it/s=344.5


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e6 b232/2315] loss=0.2759 avg=0.3931 it/s=377.4
[e6 b463/2315] loss=0.3371 avg=0.3963 it/s=376.1
[e6 b694/2315] loss=0.3198 avg=0.3997 it/s=370.1
[e6 b925/2315] loss=0.2373 avg=0.3990 it/s=373.3
[e6 b1156/2315] loss=0.9036 avg=0.3956 it/s=376.7
[e6 b1387/2315] loss=0.4619 avg=0.3962 it/s=379.6
[e6 b1618/2315] loss=0.4645 avg=0.3974 it/s=374.1
[e6 b1849/2315] loss=0.3082 avg=0.3975 it/s=352.9
[e6 b2080/2315] loss=0.3108 avg=0.3940 it/s=339.7
[e6 b2311/2315] loss=0.6521 avg=0.3930 it/s=332.6


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 6/7 | loss=0.3931 | val_acc=0.8571 | val_f1=0.8613 | time=115.8s
[e7 b1/2315] loss=0.3486 avg=0.3486 it/s=395.2
[e7 b2/2315] loss=0.2439 avg=0.2963 it/s=360.3


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e7 b232/2315] loss=0.5284 avg=0.3691 it/s=390.8
[e7 b463/2315] loss=0.5373 avg=0.3746 it/s=373.5
[e7 b694/2315] loss=0.7256 avg=0.3667 it/s=299.8
[e7 b925/2315] loss=0.3398 avg=0.3646 it/s=315.3
[e7 b1156/2315] loss=0.3876 avg=0.3640 it/s=326.4
[e7 b1387/2315] loss=0.3804 avg=0.3645 it/s=334.9
[e7 b1618/2315] loss=0.1964 avg=0.3638 it/s=343.2
[e7 b1849/2315] loss=0.4921 avg=0.3647 it/s=349.3
[e7 b2080/2315] loss=0.2040 avg=0.3657 it/s=354.7
[e7 b2311/2315] loss=0.1993 avg=0.3646 it/s=359.1


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
[34m[1mwandb[0m: [32m[41mERROR[0m The nbformat package was not found. It is required to save notebook history.


Epoch 7/7 | loss=0.3644 | val_acc=0.8630 | val_f1=0.8662 | time=107.6s


0,1
epoch,▁▁▁▁▁▂▂▂▂▂▂▂▃▃▃▃▃▃▅▅▅▅▆▆▆▆▆▆▆▇▇▇▇▇▇█████
lr,█▇▆▅▃▂▁
params/ratio,▁
params/total,▁
params/trainable,▁
step,▁▁▁▁▁▂▂▂▂▂▁▁▁▂▂▂▂▂▃▁▂▁▁▂▂▁▁▆▁▁▂▂▂▂▂▁▁▁█▂
time/epoch_sec,▂▂▂▃▁█▅
train/avg_loss_so_far,██▇▇▆▆▆▅▂▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▁▂▂▂▂▂▁▂▂▂▁▂▂▂▂
train/epoch_loss,█▄▃▂▂▁▁
train/items_per_sec,▁▃███▇█▄█▇▄▇▇▇▇▇▇▇▇▇▇▇▅▇▇█████▇▇▅▅█▇▃▄▅▆

0,1
best_checkpoint_path,checkpoints_midf1\be...
best_val_f1,0.8561
best_val_mid_f1,0.8561
epoch,7
lr,0
params/ratio,0.25635
params/total,278813189
params/trainable,71473157
step,16201
time/epoch_sec,107.62054


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
Best trial: 11. Best value: 0.872794:  65%|██████▌   | 13/20 [2:16:07<1:20:57, 693.99s/it]

[Trial 12] f1=0.8662 | unfreeze_k=10 lr=2.57e-05 wd=9.0e-06 suggested_bs=8
[I 2025-08-17 20:04:10,668] Trial 12 finished with value: 0.8661696640081556 and parameters: {'num_unfreeze_last_layers': 10, 'lr': 2.572261191520793e-05, 'weight_decay': 9.03168405297932e-06, 'batch_size': 8}. Best is trial 11 with value: 0.8727940336019774.


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/mdeberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  scaler = GradScaler(enabled=(DEVICE == "cuda" and USE_AMP))


Trainable params: 71,473,157 / 278,813,189 (25.63%) ; unfreeze_last_k=10
[e1 b1/2315] loss=1.5939 avg=1.5939 it/s=253.6
[e1 b2/2315] loss=1.5691 avg=1.5815 it/s=276.5


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e1 b232/2315] loss=1.2325 avg=1.5084 it/s=395.7
[e1 b463/2315] loss=1.2026 avg=1.4235 it/s=388.2
[e1 b694/2315] loss=1.4262 avg=1.3136 it/s=382.8
[e1 b925/2315] loss=0.7808 avg=1.2406 it/s=378.2
[e1 b1156/2315] loss=0.8395 avg=1.1894 it/s=382.0
[e1 b1387/2315] loss=1.0330 avg=1.1502 it/s=385.1
[e1 b1618/2315] loss=0.5178 avg=1.1095 it/s=386.6
[e1 b1849/2315] loss=0.8246 avg=1.0753 it/s=388.4
[e1 b2080/2315] loss=1.1568 avg=1.0498 it/s=385.8
[e1 b2311/2315] loss=0.7095 avg=1.0265 it/s=361.7


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 1/7 | loss=1.0259 | val_acc=0.7515 | val_f1=0.7598 | time=106.9s
[e2 b1/2315] loss=0.7537 avg=0.7537 it/s=300.6
[e2 b2/2315] loss=0.7814 avg=0.7675 it/s=345.6


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e2 b232/2315] loss=0.4803 avg=0.7329 it/s=398.7
[e2 b463/2315] loss=0.7441 avg=0.7479 it/s=401.1
[e2 b694/2315] loss=0.5703 avg=0.7457 it/s=389.3
[e2 b925/2315] loss=0.6256 avg=0.7462 it/s=380.0
[e2 b1156/2315] loss=0.6972 avg=0.7390 it/s=375.1
[e2 b1387/2315] loss=0.5841 avg=0.7373 it/s=377.1
[e2 b1618/2315] loss=0.2756 avg=0.7341 it/s=380.1
[e2 b1849/2315] loss=0.9255 avg=0.7291 it/s=382.2
[e2 b2080/2315] loss=0.6094 avg=0.7231 it/s=382.2
[e2 b2311/2315] loss=0.7284 avg=0.7168 it/s=379.6


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 2/7 | loss=0.7166 | val_acc=0.7925 | val_f1=0.7982 | time=102.3s
[e3 b1/2315] loss=0.7268 avg=0.7268 it/s=329.9
[e3 b2/2315] loss=0.6951 avg=0.7109 it/s=336.1


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e3 b232/2315] loss=0.7802 avg=0.6253 it/s=365.5
[e3 b463/2315] loss=0.4502 avg=0.6118 it/s=384.3
[e3 b694/2315] loss=0.6341 avg=0.6078 it/s=389.0
[e3 b925/2315] loss=0.5422 avg=0.6079 it/s=391.2
[e3 b1156/2315] loss=0.3794 avg=0.6071 it/s=390.2
[e3 b1387/2315] loss=0.9885 avg=0.6075 it/s=389.9
[e3 b1618/2315] loss=0.5476 avg=0.6045 it/s=388.6
[e3 b1849/2315] loss=0.4910 avg=0.6043 it/s=386.5
[e3 b2080/2315] loss=0.7537 avg=0.6030 it/s=386.0
[e3 b2311/2315] loss=1.3027 avg=0.6017 it/s=383.6


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 3/7 | loss=0.6015 | val_acc=0.8399 | val_f1=0.8448 | time=101.3s
[e4 b1/2315] loss=0.3310 avg=0.3310 it/s=385.1
[e4 b2/2315] loss=0.7205 avg=0.5257 it/s=310.2


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e4 b232/2315] loss=0.6330 avg=0.5367 it/s=393.3
[e4 b463/2315] loss=0.3032 avg=0.5303 it/s=397.9
[e4 b694/2315] loss=0.5564 avg=0.5244 it/s=395.7
[e4 b925/2315] loss=0.9037 avg=0.5284 it/s=392.3
[e4 b1156/2315] loss=0.5718 avg=0.5287 it/s=390.5
[e4 b1387/2315] loss=0.5726 avg=0.5337 it/s=391.6
[e4 b1618/2315] loss=0.4789 avg=0.5287 it/s=387.5
[e4 b1849/2315] loss=0.3720 avg=0.5254 it/s=385.4
[e4 b2080/2315] loss=0.6222 avg=0.5241 it/s=384.6
[e4 b2311/2315] loss=0.3074 avg=0.5235 it/s=384.9


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 4/7 | loss=0.5239 | val_acc=0.8508 | val_f1=0.8553 | time=100.9s
[e5 b1/2315] loss=0.6916 avg=0.6916 it/s=316.7
[e5 b2/2315] loss=0.4122 avg=0.5519 it/s=339.3


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e5 b232/2315] loss=0.3100 avg=0.4696 it/s=388.6
[e5 b463/2315] loss=0.5657 avg=0.4717 it/s=386.1
[e5 b694/2315] loss=0.2255 avg=0.4725 it/s=389.4
[e5 b925/2315] loss=0.3970 avg=0.4709 it/s=390.4
[e5 b1156/2315] loss=0.2670 avg=0.4718 it/s=392.5
[e5 b1387/2315] loss=0.6212 avg=0.4703 it/s=394.0
[e5 b1618/2315] loss=0.3172 avg=0.4709 it/s=394.2
[e5 b1849/2315] loss=0.6562 avg=0.4707 it/s=394.5
[e5 b2080/2315] loss=1.1021 avg=0.4687 it/s=390.2
[e5 b2311/2315] loss=0.8951 avg=0.4677 it/s=385.5


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 5/7 | loss=0.4680 | val_acc=0.8554 | val_f1=0.8593 | time=100.8s
[e6 b1/2315] loss=0.5162 avg=0.5162 it/s=309.8
[e6 b2/2315] loss=0.2833 avg=0.3998 it/s=320.3


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e6 b232/2315] loss=0.4217 avg=0.4210 it/s=377.0
[e6 b463/2315] loss=0.7848 avg=0.4265 it/s=386.2
[e6 b694/2315] loss=0.4603 avg=0.4248 it/s=389.7
[e6 b925/2315] loss=0.4472 avg=0.4236 it/s=384.5
[e6 b1156/2315] loss=0.2200 avg=0.4176 it/s=380.5
[e6 b1387/2315] loss=1.1225 avg=0.4170 it/s=378.0
[e6 b1618/2315] loss=0.5099 avg=0.4169 it/s=378.8
[e6 b1849/2315] loss=0.5078 avg=0.4165 it/s=382.0
[e6 b2080/2315] loss=0.2219 avg=0.4143 it/s=383.5
[e6 b2311/2315] loss=0.3634 avg=0.4140 it/s=383.5


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 6/7 | loss=0.4139 | val_acc=0.8664 | val_f1=0.8700 | time=101.3s
[e7 b1/2315] loss=0.4834 avg=0.4834 it/s=632.4
[e7 b2/2315] loss=0.4861 avg=0.4848 it/s=429.3


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e7 b232/2315] loss=0.7490 avg=0.3861 it/s=383.9
[e7 b463/2315] loss=0.2349 avg=0.3802 it/s=382.1
[e7 b694/2315] loss=0.3068 avg=0.3742 it/s=379.1
[e7 b925/2315] loss=0.2317 avg=0.3723 it/s=377.0
[e7 b1156/2315] loss=0.4130 avg=0.3675 it/s=376.9
[e7 b1387/2315] loss=0.3987 avg=0.3703 it/s=376.6
[e7 b1618/2315] loss=0.2143 avg=0.3664 it/s=372.0
[e7 b1849/2315] loss=0.2549 avg=0.3682 it/s=371.4
[e7 b2080/2315] loss=0.2682 avg=0.3702 it/s=373.3
[e7 b2311/2315] loss=0.2384 avg=0.3694 it/s=374.1


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
[34m[1mwandb[0m: [32m[41mERROR[0m The nbformat package was not found. It is required to save notebook history.


Epoch 7/7 | loss=0.3694 | val_acc=0.8661 | val_f1=0.8695 | time=103.7s


0,1
epoch,▁▁▁▁▁▁▁▁▂▂▂▂▃▃▃▃▃▅▅▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇█████
lr,█▇▆▅▃▂▁
params/ratio,▁
params/total,▁
params/trainable,▁
step,▁▁▂▂▁▃▂▃▁▁▁▂▂▄▂▄▁▁▂▂▁▁▁▁▁▂▂▆▂▇▇▁▂▂█▁▁▁▂▂
time/epoch_sec,█▃▂▁▁▂▄
train/avg_loss_so_far,█▇▆▆▆▅▅▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▃▂▂▂▂▂▁▁▁▁▁▁▁▂▁▁
train/epoch_loss,█▅▃▃▂▁▁
train/items_per_sec,▆▅▅▆▅▃▆▆▅▅▅▅▆▆▅▅▁▆▆▆▆▆▅▆▆▆▅▁▂▅▅▅▅▅▅█▅▅▅▅

0,1
best_checkpoint_path,checkpoints_midf1\be...
best_val_f1,0.85934
best_val_mid_f1,0.85934
epoch,7
lr,0
params/ratio,0.25635
params/total,278813189
params/trainable,71473157
step,16201
time/epoch_sec,103.70838


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
Best trial: 11. Best value: 0.872794:  70%|███████   | 14/20 [2:28:20<1:10:34, 705.78s/it]

[Trial 13] f1=0.8695 | unfreeze_k=10 lr=7.53e-05 wd=1.1e-06 suggested_bs=8
[I 2025-08-17 20:16:23,695] Trial 13 finished with value: 0.869482841547233 and parameters: {'num_unfreeze_last_layers': 10, 'lr': 7.527944983935378e-05, 'weight_decay': 1.0829233114724898e-06, 'batch_size': 8}. Best is trial 11 with value: 0.8727940336019774.


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/mdeberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  scaler = GradScaler(enabled=(DEVICE == "cuda" and USE_AMP))


Trainable params: 64,385,285 / 278,813,189 (23.09%) ; unfreeze_last_k=9
[e1 b1/2315] loss=1.5763 avg=1.5763 it/s=127.6
[e1 b2/2315] loss=1.6025 avg=1.5894 it/s=185.2


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e1 b232/2315] loss=1.5313 avg=1.5433 it/s=375.3
[e1 b463/2315] loss=1.4876 avg=1.4846 it/s=382.1
[e1 b694/2315] loss=1.0125 avg=1.4124 it/s=386.3
[e1 b925/2315] loss=1.0557 avg=1.3384 it/s=384.3
[e1 b1156/2315] loss=0.8091 avg=1.2780 it/s=384.8
[e1 b1387/2315] loss=1.1321 avg=1.2274 it/s=387.4
[e1 b1618/2315] loss=1.0060 avg=1.1782 it/s=390.3
[e1 b1849/2315] loss=0.7094 avg=1.1378 it/s=392.8
[e1 b2080/2315] loss=0.8868 avg=1.1052 it/s=395.3
[e1 b2311/2315] loss=0.5668 avg=1.0772 it/s=397.6


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 1/7 | loss=1.0765 | val_acc=0.7600 | val_f1=0.7672 | time=97.8s
[e2 b1/2315] loss=0.6886 avg=0.6886 it/s=341.5
[e2 b2/2315] loss=0.8275 avg=0.7581 it/s=342.7


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e2 b232/2315] loss=0.9294 avg=0.7543 it/s=410.6
[e2 b463/2315] loss=0.5514 avg=0.7392 it/s=372.2
[e2 b694/2315] loss=0.7800 avg=0.7344 it/s=334.3
[e2 b925/2315] loss=0.8013 avg=0.7271 it/s=315.4
[e2 b1156/2315] loss=0.8873 avg=0.7188 it/s=301.5
[e2 b1387/2315] loss=0.8049 avg=0.7186 it/s=294.1
[e2 b1618/2315] loss=0.6188 avg=0.7113 it/s=293.3
[e2 b1849/2315] loss=0.4104 avg=0.7087 it/s=290.6
[e2 b2080/2315] loss=0.4033 avg=0.7036 it/s=290.5
[e2 b2311/2315] loss=0.9614 avg=0.6994 it/s=286.2


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 2/7 | loss=0.6995 | val_acc=0.8032 | val_f1=0.8086 | time=133.9s
[e3 b1/2315] loss=0.6615 avg=0.6615 it/s=349.6
[e3 b2/2315] loss=0.5340 avg=0.5977 it/s=351.0


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e3 b232/2315] loss=0.5574 avg=0.5566 it/s=405.5
[e3 b463/2315] loss=0.4115 avg=0.5833 it/s=381.8
[e3 b694/2315] loss=0.8856 avg=0.5905 it/s=345.5
[e3 b925/2315] loss=0.5188 avg=0.5935 it/s=328.9
[e3 b1156/2315] loss=0.6576 avg=0.5932 it/s=322.2
[e3 b1387/2315] loss=0.7486 avg=0.5901 it/s=319.3
[e3 b1618/2315] loss=0.5963 avg=0.5891 it/s=310.2
[e3 b1849/2315] loss=0.6771 avg=0.5866 it/s=320.3
[e3 b2080/2315] loss=0.4683 avg=0.5847 it/s=329.1
[e3 b2311/2315] loss=0.3503 avg=0.5834 it/s=336.6


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 3/7 | loss=0.5834 | val_acc=0.8158 | val_f1=0.8215 | time=114.6s
[e4 b1/2315] loss=0.3492 avg=0.3492 it/s=392.7
[e4 b2/2315] loss=0.8180 avg=0.5836 it/s=396.3


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e4 b232/2315] loss=0.9351 avg=0.5265 it/s=408.4
[e4 b463/2315] loss=0.6130 avg=0.5173 it/s=400.8
[e4 b694/2315] loss=0.4662 avg=0.5128 it/s=399.6
[e4 b925/2315] loss=0.2753 avg=0.5110 it/s=400.2
[e4 b1156/2315] loss=0.7577 avg=0.5117 it/s=404.1
[e4 b1387/2315] loss=0.5856 avg=0.5128 it/s=407.3
[e4 b1618/2315] loss=0.4218 avg=0.5141 it/s=409.6
[e4 b1849/2315] loss=0.4169 avg=0.5136 it/s=410.4
[e4 b2080/2315] loss=0.3733 avg=0.5126 it/s=412.4
[e4 b2311/2315] loss=0.4518 avg=0.5133 it/s=413.3


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 4/7 | loss=0.5132 | val_acc=0.8455 | val_f1=0.8500 | time=94.1s
[e5 b1/2315] loss=0.3517 avg=0.3517 it/s=380.4
[e5 b2/2315] loss=0.2755 avg=0.3136 it/s=390.7


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e5 b232/2315] loss=0.8273 avg=0.4722 it/s=423.8
[e5 b463/2315] loss=0.7173 avg=0.4714 it/s=426.0
[e5 b694/2315] loss=0.6931 avg=0.4681 it/s=426.8
[e5 b925/2315] loss=0.3393 avg=0.4667 it/s=422.0
[e5 b1156/2315] loss=0.7813 avg=0.4629 it/s=411.2
[e5 b1387/2315] loss=0.7521 avg=0.4639 it/s=402.6
[e5 b1618/2315] loss=0.5399 avg=0.4636 it/s=400.6
[e5 b1849/2315] loss=0.5447 avg=0.4623 it/s=403.3
[e5 b2080/2315] loss=0.6809 avg=0.4615 it/s=404.9
[e5 b2311/2315] loss=0.5221 avg=0.4612 it/s=406.2


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 5/7 | loss=0.4611 | val_acc=0.8477 | val_f1=0.8531 | time=95.8s
[e6 b1/2315] loss=0.8156 avg=0.8156 it/s=396.3
[e6 b2/2315] loss=0.2391 avg=0.5273 it/s=370.8


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e6 b232/2315] loss=0.2343 avg=0.4205 it/s=392.9
[e6 b463/2315] loss=0.6515 avg=0.4256 it/s=378.2
[e6 b694/2315] loss=0.2570 avg=0.4239 it/s=381.8
[e6 b925/2315] loss=0.5569 avg=0.4238 it/s=390.8
[e6 b1156/2315] loss=0.6020 avg=0.4256 it/s=383.9
[e6 b1387/2315] loss=0.2543 avg=0.4246 it/s=365.1
[e6 b1618/2315] loss=0.2957 avg=0.4246 it/s=341.8
[e6 b1849/2315] loss=0.2735 avg=0.4239 it/s=348.3
[e6 b2080/2315] loss=0.3734 avg=0.4220 it/s=353.5
[e6 b2311/2315] loss=0.2381 avg=0.4230 it/s=357.9


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 6/7 | loss=0.4229 | val_acc=0.8511 | val_f1=0.8560 | time=108.2s
[e7 b1/2315] loss=0.4932 avg=0.4932 it/s=566.4
[e7 b2/2315] loss=0.5145 avg=0.5039 it/s=406.2


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e7 b232/2315] loss=0.8898 avg=0.3956 it/s=398.3
[e7 b463/2315] loss=0.3943 avg=0.3991 it/s=386.4
[e7 b694/2315] loss=0.2238 avg=0.3974 it/s=363.5
[e7 b925/2315] loss=0.6054 avg=0.3971 it/s=356.1
[e7 b1156/2315] loss=0.2925 avg=0.3938 it/s=350.8
[e7 b1387/2315] loss=0.4738 avg=0.3920 it/s=346.2
[e7 b1618/2315] loss=0.4318 avg=0.3908 it/s=344.6
[e7 b1849/2315] loss=0.3924 avg=0.3921 it/s=342.9
[e7 b2080/2315] loss=0.3038 avg=0.3941 it/s=341.3
[e7 b2311/2315] loss=0.2524 avg=0.3941 it/s=342.6


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
[34m[1mwandb[0m: [32m[41mERROR[0m The nbformat package was not found. It is required to save notebook history.


Epoch 7/7 | loss=0.3941 | val_acc=0.8567 | val_f1=0.8608 | time=113.7s


0,1
epoch,▁▁▁▁▂▂▂▂▂▂▂▃▃▃▃▃▃▃▅▅▅▅▆▆▆▆▇▇▇▇▇▇▇███████
lr,█▇▆▅▃▂▁
params/ratio,▁
params/total,▁
params/trainable,▁
step,▁▁▁▂▂▂▂▁▁▁▂▂▂▂▃▃▁▂▂▂▂▂▁▁▁▅▂▂▂▅▂▂▂▇▂▁█▁▁▂
time/epoch_sec,▂█▅▁▁▃▄
train/avg_loss_so_far,██▇▇▇▃▃▃▃▃▃▃▃▃▃▃▂▂▂▂▂▁▁▂▂▂▂▄▂▂▂▂▂▂▁▁▁▁▁▁
train/epoch_loss,█▄▃▂▂▁▁
train/items_per_sec,▁▇▇▇▇▇▇▇▆▆▅▄▄▄▆▆▅▅▅▇▇▇██▇███▇▇▇▇▇▇▇▆▇▆▆▆

0,1
best_checkpoint_path,checkpoints_midf1\be...
best_val_f1,0.84748
best_val_mid_f1,0.84748
epoch,7
lr,0
params/ratio,0.23093
params/total,278813189
params/trainable,64385285
step,16201
time/epoch_sec,113.6974


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
Best trial: 11. Best value: 0.872794:  75%|███████▌  | 15/20 [2:41:13<1:00:29, 725.96s/it]

[Trial 14] f1=0.8608 | unfreeze_k=9 lr=2.46e-05 wd=4.6e-06 suggested_bs=8
[I 2025-08-17 20:29:16,420] Trial 14 finished with value: 0.8608142355635048 and parameters: {'num_unfreeze_last_layers': 9, 'lr': 2.4550953525635824e-05, 'weight_decay': 4.557439822949102e-06, 'batch_size': 8}. Best is trial 11 with value: 0.8727940336019774.


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/mdeberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  scaler = GradScaler(enabled=(DEVICE == "cuda" and USE_AMP))


Trainable params: 78,561,029 / 278,813,189 (28.18%) ; unfreeze_last_k=11
[e1 b1/2315] loss=1.5774 avg=1.5774 it/s=129.0
[e1 b2/2315] loss=1.6572 avg=1.6173 it/s=170.7


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e1 b232/2315] loss=1.4284 avg=1.5252 it/s=260.4
[e1 b463/2315] loss=1.1602 avg=1.4583 it/s=260.7
[e1 b694/2315] loss=1.1941 avg=1.3616 it/s=278.2
[e1 b925/2315] loss=0.7733 avg=1.2790 it/s=285.8
[e1 b1156/2315] loss=0.6321 avg=1.2116 it/s=289.9
[e1 b1387/2315] loss=1.0436 avg=1.1597 it/s=293.2
[e1 b1618/2315] loss=0.6534 avg=1.1141 it/s=296.6
[e1 b1849/2315] loss=1.0319 avg=1.0739 it/s=298.4
[e1 b2080/2315] loss=0.7777 avg=1.0446 it/s=306.0
[e1 b2311/2315] loss=0.5667 avg=1.0156 it/s=312.2


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 1/7 | loss=1.0149 | val_acc=0.7799 | val_f1=0.7841 | time=123.2s
[e2 b1/2315] loss=0.9026 avg=0.9026 it/s=320.5
[e2 b2/2315] loss=0.6185 avg=0.7605 it/s=328.8


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e2 b232/2315] loss=0.4643 avg=0.7057 it/s=346.7
[e2 b463/2315] loss=0.5139 avg=0.6803 it/s=349.0
[e2 b694/2315] loss=0.5763 avg=0.6754 it/s=357.2
[e2 b925/2315] loss=0.6586 avg=0.6687 it/s=363.9
[e2 b1156/2315] loss=0.5156 avg=0.6663 it/s=368.4
[e2 b1387/2315] loss=0.4350 avg=0.6588 it/s=371.6
[e2 b1618/2315] loss=0.8694 avg=0.6554 it/s=370.2
[e2 b1849/2315] loss=0.3272 avg=0.6542 it/s=368.7
[e2 b2080/2315] loss=0.7713 avg=0.6509 it/s=368.5
[e2 b2311/2315] loss=0.4203 avg=0.6470 it/s=368.2


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 2/7 | loss=0.6469 | val_acc=0.8175 | val_f1=0.8205 | time=105.4s
[e3 b1/2315] loss=0.6320 avg=0.6320 it/s=300.0
[e3 b2/2315] loss=0.6271 avg=0.6295 it/s=324.0


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e3 b232/2315] loss=0.8805 avg=0.5368 it/s=351.6
[e3 b463/2315] loss=0.7453 avg=0.5428 it/s=359.3
[e3 b694/2315] loss=0.4865 avg=0.5439 it/s=363.4
[e3 b925/2315] loss=0.6788 avg=0.5458 it/s=365.6
[e3 b1156/2315] loss=0.3652 avg=0.5462 it/s=368.1
[e3 b1387/2315] loss=0.2399 avg=0.5428 it/s=369.7
[e3 b1618/2315] loss=0.7000 avg=0.5424 it/s=371.8
[e3 b1849/2315] loss=0.4892 avg=0.5409 it/s=371.1
[e3 b2080/2315] loss=0.5939 avg=0.5376 it/s=369.5
[e3 b2311/2315] loss=0.2455 avg=0.5372 it/s=369.4


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 3/7 | loss=0.5368 | val_acc=0.8445 | val_f1=0.8491 | time=105.0s
[e4 b1/2315] loss=0.3757 avg=0.3757 it/s=376.0
[e4 b2/2315] loss=0.2757 avg=0.3257 it/s=328.3


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e4 b232/2315] loss=0.7580 avg=0.4639 it/s=380.7
[e4 b463/2315] loss=0.4461 avg=0.4631 it/s=380.9
[e4 b694/2315] loss=0.3735 avg=0.4600 it/s=378.1
[e4 b925/2315] loss=0.3211 avg=0.4606 it/s=377.9
[e4 b1156/2315] loss=0.3641 avg=0.4605 it/s=376.9
[e4 b1387/2315] loss=0.9307 avg=0.4652 it/s=367.1
[e4 b1618/2315] loss=0.6017 avg=0.4691 it/s=344.5
[e4 b1849/2315] loss=0.4384 avg=0.4690 it/s=349.7
[e4 b2080/2315] loss=0.8200 avg=0.4700 it/s=349.9
[e4 b2311/2315] loss=0.2765 avg=0.4682 it/s=349.3


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 4/7 | loss=0.4681 | val_acc=0.8584 | val_f1=0.8620 | time=110.8s
[e5 b1/2315] loss=0.9398 avg=0.9398 it/s=327.5
[e5 b2/2315] loss=0.8474 avg=0.8936 it/s=317.6


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e5 b232/2315] loss=0.2155 avg=0.4226 it/s=353.3
[e5 b463/2315] loss=0.5603 avg=0.4205 it/s=366.6
[e5 b694/2315] loss=0.3718 avg=0.4222 it/s=347.3
[e5 b925/2315] loss=0.2215 avg=0.4229 it/s=320.5
[e5 b1156/2315] loss=0.4927 avg=0.4231 it/s=328.0
[e5 b1387/2315] loss=0.4619 avg=0.4201 it/s=333.4
[e5 b1618/2315] loss=0.4988 avg=0.4198 it/s=340.1
[e5 b1849/2315] loss=0.2878 avg=0.4172 it/s=345.4
[e5 b2080/2315] loss=0.2410 avg=0.4186 it/s=349.7
[e5 b2311/2315] loss=0.6017 avg=0.4181 it/s=350.7


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 5/7 | loss=0.4180 | val_acc=0.8598 | val_f1=0.8626 | time=110.2s
[e6 b1/2315] loss=0.2644 avg=0.2644 it/s=344.1
[e6 b2/2315] loss=0.4250 avg=0.3447 it/s=343.1


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e6 b232/2315] loss=0.4880 avg=0.3569 it/s=362.9
[e6 b463/2315] loss=0.4451 avg=0.3655 it/s=364.9
[e6 b694/2315] loss=0.6100 avg=0.3642 it/s=361.1
[e6 b925/2315] loss=0.2057 avg=0.3683 it/s=326.7
[e6 b1156/2315] loss=0.5095 avg=0.3674 it/s=334.4
[e6 b1387/2315] loss=0.3204 avg=0.3680 it/s=340.9
[e6 b1618/2315] loss=0.7116 avg=0.3695 it/s=346.2
[e6 b1849/2315] loss=0.2537 avg=0.3685 it/s=349.7
[e6 b2080/2315] loss=0.3742 avg=0.3665 it/s=353.0
[e6 b2311/2315] loss=0.4421 avg=0.3680 it/s=355.4


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 6/7 | loss=0.3681 | val_acc=0.8727 | val_f1=0.8763 | time=109.2s
[e7 b1/2315] loss=0.2701 avg=0.2701 it/s=391.8
[e7 b2/2315] loss=0.3360 avg=0.3030 it/s=365.5


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e7 b232/2315] loss=0.3398 avg=0.3419 it/s=362.6
[e7 b463/2315] loss=0.2341 avg=0.3364 it/s=361.9
[e7 b694/2315] loss=0.2005 avg=0.3411 it/s=366.4
[e7 b925/2315] loss=0.3279 avg=0.3414 it/s=370.1
[e7 b1156/2315] loss=0.5047 avg=0.3391 it/s=372.8
[e7 b1387/2315] loss=0.2734 avg=0.3411 it/s=373.6
[e7 b1618/2315] loss=0.2130 avg=0.3396 it/s=374.9
[e7 b1849/2315] loss=0.4805 avg=0.3376 it/s=376.0
[e7 b2080/2315] loss=0.2138 avg=0.3372 it/s=377.2
[e7 b2311/2315] loss=0.4046 avg=0.3370 it/s=378.4


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
[34m[1mwandb[0m: [32m[41mERROR[0m The nbformat package was not found. It is required to save notebook history.


Epoch 7/7 | loss=0.3369 | val_acc=0.8710 | val_f1=0.8744 | time=102.4s


0,1
epoch,▁▁▁▁▁▂▂▂▂▂▂▂▂▃▃▅▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇▇███████
lr,█▇▆▅▃▂▁
params/ratio,▁
params/total,▁
params/trainable,▁
step,▁▁▁▂▂▃▃▃▁▁▃▃▂▂▃▃█▁▂▂▃▃▁▂▂▃▃▃▃▂▃▃▃▃▁▂▂▂▃▃
time/epoch_sec,█▂▂▄▄▃▁
train/avg_loss_so_far,█▇▇▆▆▅▄▃▃▃▃▃▃▂▂▂▂▂▁▁▂▂▂▂▂▄▂▂▂▂▁▁▁▁▁▁▁▁▁▁
train/epoch_loss,█▄▃▂▂▁▁
train/items_per_sec,▁▅▅▅▅▆▆▇▇███▆▇██████▇▇▇▇▆▆▇▇▇▇██▇▇▇█████

0,1
best_checkpoint_path,checkpoints_midf1\be...
best_val_f1,0.86604
best_val_mid_f1,0.86604
epoch,7
lr,0
params/ratio,0.28177
params/total,278813189
params/trainable,78561029
step,16201
time/epoch_sec,102.38467


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
Best trial: 15. Best value: 0.876329:  80%|████████  | 16/20 [2:54:14<49:29, 742.49s/it]  

[Trial 15] f1=0.8763 | unfreeze_k=11 lr=4.17e-05 wd=9.8e-06 suggested_bs=8
[I 2025-08-17 20:42:17,307] Trial 15 finished with value: 0.8763286523329038 and parameters: {'num_unfreeze_last_layers': 11, 'lr': 4.171825599095043e-05, 'weight_decay': 9.830249183132208e-06, 'batch_size': 8}. Best is trial 15 with value: 0.8763286523329038.


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/mdeberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  scaler = GradScaler(enabled=(DEVICE == "cuda" and USE_AMP))


Trainable params: 64,385,285 / 278,813,189 (23.09%) ; unfreeze_last_k=9
[e1 b1/2315] loss=1.6327 avg=1.6327 it/s=144.6
[e1 b2/2315] loss=1.5708 avg=1.6017 it/s=205.0


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e1 b232/2315] loss=1.3444 avg=1.4967 it/s=367.6
[e1 b463/2315] loss=1.3013 avg=1.3829 it/s=366.3
[e1 b694/2315] loss=0.8946 avg=1.3013 it/s=373.0
[e1 b925/2315] loss=1.8271 avg=1.2604 it/s=382.0
[e1 b1156/2315] loss=1.6504 avg=1.2445 it/s=383.8
[e1 b1387/2315] loss=1.6782 avg=1.2297 it/s=355.0
[e1 b1618/2315] loss=0.9148 avg=1.2201 it/s=357.1
[e1 b1849/2315] loss=1.4433 avg=1.2275 it/s=359.6
[e1 b2080/2315] loss=1.6244 avg=1.2297 it/s=362.0
[e1 b2311/2315] loss=1.2484 avg=1.2369 it/s=366.9


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 1/7 | loss=1.2368 | val_acc=0.4373 | val_f1=0.3556 | time=105.5s
[e2 b1/2315] loss=1.3269 avg=1.3269 it/s=394.5
[e2 b2/2315] loss=1.0305 avg=1.1787 it/s=389.0


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e2 b232/2315] loss=1.1088 avg=1.2297 it/s=419.0
[e2 b463/2315] loss=1.3101 avg=1.2198 it/s=419.5
[e2 b694/2315] loss=1.4162 avg=1.2393 it/s=411.6
[e2 b925/2315] loss=1.3581 avg=1.2787 it/s=411.5
[e2 b1156/2315] loss=1.3582 avg=1.2990 it/s=408.2
[e2 b1387/2315] loss=1.7065 avg=1.3158 it/s=404.8
[e2 b1618/2315] loss=1.5540 avg=1.3176 it/s=405.4
[e2 b1849/2315] loss=1.6738 avg=1.3181 it/s=404.3
[e2 b2080/2315] loss=1.6574 avg=1.3365 it/s=404.3
[e2 b2311/2315] loss=1.4494 avg=1.3525 it/s=405.1


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 2/7 | loss=1.3527 | val_acc=0.2775 | val_f1=0.0869 | time=96.0s
[e3 b1/2315] loss=1.6925 avg=1.6925 it/s=629.7
[e3 b2/2315] loss=1.4202 avg=1.5564 it/s=445.9


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e3 b232/2315] loss=1.4016 avg=1.4989 it/s=420.5
[e3 b463/2315] loss=1.6028 avg=1.4968 it/s=419.1
[e3 b694/2315] loss=1.4826 avg=1.4954 it/s=420.2
[e3 b925/2315] loss=1.4355 avg=1.4966 it/s=418.9
[e3 b1156/2315] loss=1.3509 avg=1.4968 it/s=417.7
[e3 b1387/2315] loss=1.4008 avg=1.4971 it/s=414.7
[e3 b1618/2315] loss=1.5549 avg=1.4958 it/s=412.0
[e3 b1849/2315] loss=1.6599 avg=1.4973 it/s=410.8
[e3 b2080/2315] loss=1.5463 avg=1.4967 it/s=412.0
[e3 b2311/2315] loss=1.5955 avg=1.4970 it/s=413.1


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 3/7 | loss=1.4970 | val_acc=0.2775 | val_f1=0.0869 | time=94.3s
[e4 b1/2315] loss=1.4970 avg=1.4970 it/s=412.8
[e4 b2/2315] loss=1.5150 avg=1.5060 it/s=392.7


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e4 b232/2315] loss=1.4514 avg=1.4964 it/s=413.5
[e4 b463/2315] loss=1.6099 avg=1.4953 it/s=414.2
[e4 b694/2315] loss=1.6642 avg=1.4944 it/s=414.0
[e4 b925/2315] loss=1.3729 avg=1.4942 it/s=406.1
[e4 b1156/2315] loss=1.5168 avg=1.4959 it/s=408.4
[e4 b1387/2315] loss=1.6167 avg=1.4965 it/s=410.6
[e4 b1618/2315] loss=1.4290 avg=1.4952 it/s=411.9
[e4 b1849/2315] loss=1.4295 avg=1.4949 it/s=409.1
[e4 b2080/2315] loss=1.4462 avg=1.4952 it/s=404.5
[e4 b2311/2315] loss=1.4626 avg=1.4960 it/s=401.1


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
[34m[1mwandb[0m: [32m[41mERROR[0m The nbformat package was not found. It is required to save notebook history.


Early stopping at epoch 4


0,1
epoch,▁▁▁▁▁▁▁▁▁▁▁▁▃▃▃▃▃▃▃▃▆▆▆▆▆▆▆▆▆▆██████████
lr,█▆▃▁
params/ratio,▁
params/total,▁
params/trainable,▁
step,▁▁▂▂▂▂▂▂▃▃▃▁▁▁▁▃▂▂▂▂▂▃▁▁▂▂▂▂▂▂▃▃▁▁▁█▂█▂▃
time/epoch_sec,█▂▁▃
train/avg_loss_so_far,▇▇▅▄▃▂▂▂▂▂▃▁▂▂▂▃▃▃▃▃█▆▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅
train/epoch_loss,▁▄██
train/items_per_sec,▁▂▄▄▄▄▄▄▄▄▅▅▅▅▅▅▅▅▅▅█▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅

0,1
best_checkpoint_path,checkpoints_midf1\be...
best_val_f1,0.42473
best_val_mid_f1,0.42473
epoch,4
lr,7e-05
params/ratio,0.23093
params/total,278813189
params/trainable,64385285
step,9256
time/epoch_sec,97.13188


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
Best trial: 15. Best value: 0.876329:  85%|████████▌ | 17/20 [3:00:56<32:00, 640.19s/it]

[Trial 16] f1=0.3556 | unfreeze_k=9 lr=1.44e-04 wd=8.9e-06 suggested_bs=4
[I 2025-08-17 20:48:59,579] Trial 16 finished with value: 0.3555999440373106 and parameters: {'num_unfreeze_last_layers': 9, 'lr': 0.00014358541335158907, 'weight_decay': 8.926002789778153e-06, 'batch_size': 4}. Best is trial 15 with value: 0.8763286523329038.


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/mdeberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  scaler = GradScaler(enabled=(DEVICE == "cuda" and USE_AMP))


Trainable params: 78,561,029 / 278,813,189 (28.18%) ; unfreeze_last_k=11
[e1 b1/2315] loss=1.5932 avg=1.5932 it/s=246.0
[e1 b2/2315] loss=1.5888 avg=1.5910 it/s=287.4


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e1 b232/2315] loss=1.6923 avg=1.5050 it/s=374.3
[e1 b463/2315] loss=1.4619 avg=1.5132 it/s=375.2
[e1 b694/2315] loss=1.6405 avg=1.5128 it/s=367.6
[e1 b925/2315] loss=1.5371 avg=1.5135 it/s=364.4
[e1 b1156/2315] loss=1.5179 avg=1.5119 it/s=359.6
[e1 b1387/2315] loss=1.5355 avg=1.5109 it/s=363.1
[e1 b1618/2315] loss=1.5542 avg=1.5099 it/s=366.2
[e1 b1849/2315] loss=1.5140 avg=1.5095 it/s=368.3
[e1 b2080/2315] loss=1.4892 avg=1.5082 it/s=367.8
[e1 b2311/2315] loss=1.5955 avg=1.5080 it/s=367.3


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 1/7 | loss=1.5080 | val_acc=0.2775 | val_f1=0.0869 | time=105.4s
[e2 b1/2315] loss=1.6071 avg=1.6071 it/s=301.8
[e2 b2/2315] loss=1.5117 avg=1.5594 it/s=364.4


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e2 b232/2315] loss=1.4869 avg=1.4944 it/s=357.0
[e2 b463/2315] loss=1.6823 avg=1.4965 it/s=362.7
[e2 b694/2315] loss=1.5379 avg=1.4987 it/s=361.1
[e2 b925/2315] loss=1.5904 avg=1.4991 it/s=364.4
[e2 b1156/2315] loss=1.4528 avg=1.5000 it/s=367.9
[e2 b1387/2315] loss=1.5286 avg=1.4993 it/s=369.7
[e2 b1618/2315] loss=1.4088 avg=1.5009 it/s=371.0
[e2 b1849/2315] loss=1.3930 avg=1.5008 it/s=372.9
[e2 b2080/2315] loss=1.5183 avg=1.4988 it/s=373.3
[e2 b2311/2315] loss=1.5356 avg=1.4985 it/s=372.4


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 2/7 | loss=1.4984 | val_acc=0.2775 | val_f1=0.0869 | time=104.2s
[e3 b1/2315] loss=1.6221 avg=1.6221 it/s=387.7
[e3 b2/2315] loss=1.4445 avg=1.5333 it/s=440.1


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e3 b232/2315] loss=1.5155 avg=1.4947 it/s=355.5
[e3 b463/2315] loss=1.5087 avg=1.4967 it/s=351.4
[e3 b694/2315] loss=1.3977 avg=1.4963 it/s=359.9
[e3 b925/2315] loss=1.6559 avg=1.4972 it/s=364.4
[e3 b1156/2315] loss=1.7723 avg=1.4964 it/s=365.9
[e3 b1387/2315] loss=1.3913 avg=1.4962 it/s=366.2
[e3 b1618/2315] loss=1.4368 avg=1.4970 it/s=367.3
[e3 b1849/2315] loss=1.5313 avg=1.4964 it/s=367.9
[e3 b2080/2315] loss=1.4332 avg=1.4964 it/s=369.1
[e3 b2311/2315] loss=1.6576 avg=1.4965 it/s=371.1


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 3/7 | loss=1.4964 | val_acc=0.2775 | val_f1=0.0869 | time=104.4s
[e4 b1/2315] loss=1.4178 avg=1.4178 it/s=391.4
[e4 b2/2315] loss=1.4189 avg=1.4183 it/s=390.7


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e4 b232/2315] loss=1.4883 avg=1.4874 it/s=366.6
[e4 b463/2315] loss=1.6446 avg=1.5010 it/s=351.9
[e4 b694/2315] loss=1.3531 avg=1.4991 it/s=348.7
[e4 b925/2315] loss=1.5830 avg=1.4972 it/s=348.8
[e4 b1156/2315] loss=1.5329 avg=1.4970 it/s=352.6
[e4 b1387/2315] loss=1.6090 avg=1.4954 it/s=356.2
[e4 b1618/2315] loss=1.3518 avg=1.4955 it/s=356.6
[e4 b1849/2315] loss=1.6348 avg=1.4955 it/s=356.4
[e4 b2080/2315] loss=1.4195 avg=1.4964 it/s=357.7
[e4 b2311/2315] loss=1.4583 avg=1.4959 it/s=356.1


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
[34m[1mwandb[0m: [32m[41mERROR[0m The nbformat package was not found. It is required to save notebook history.


Early stopping at epoch 4


0,1
epoch,▁▁▁▁▁▁▁▁▁▃▃▃▃▃▃▃▃▃▃▃▃▆▆▆▆▆▆▆▆▆▆▆▆███████
lr,█▆▃▁
params/ratio,▁
params/total,▁
params/trainable,▁
step,▁▁▁▂▂▃▃▂▂▂▂▂▂▂▂▃▃▃▃▃▁▁▁▂▂▇▃▃█▁█▂▂▂▂▃▃▃▃▃
time/epoch_sec,▃▁▁█
train/avg_loss_so_far,▇▇▄▄▄▄▄▄▄▄▇▆▄▄▄▄▄▄▄▄█▅▄▄▄▄▄▄▄▄▁▁▃▄▄▄▄▄▄▄
train/epoch_loss,█▃▁▁
train/items_per_sec,▁▂▆▆▅▅▅▅▅▅▃▅▅▅▅▅▅▆▆▆▆█▅▅▅▅▅▅▅▅▆▆▅▅▅▅▅▅▅▅

0,1
best_checkpoint_path,checkpoints_midf1\be...
best_val_f1,0.1448
best_val_mid_f1,0.1448
epoch,4
lr,0.00044
params/ratio,0.28177
params/total,278813189
params/trainable,78561029
step,9256
time/epoch_sec,108.59882


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
Best trial: 15. Best value: 0.876329:  90%|█████████ | 18/20 [3:08:08<19:15, 577.52s/it]

[Trial 17] f1=0.0869 | unfreeze_k=11 lr=9.73e-04 wd=2.3e-06 suggested_bs=8
[I 2025-08-17 20:56:11,225] Trial 17 finished with value: 0.08687713959680486 and parameters: {'num_unfreeze_last_layers': 11, 'lr': 0.0009725811140454004, 'weight_decay': 2.29271284636435e-06, 'batch_size': 8}. Best is trial 15 with value: 0.8763286523329038.


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/mdeberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  scaler = GradScaler(enabled=(DEVICE == "cuda" and USE_AMP))


Trainable params: 64,385,285 / 278,813,189 (23.09%) ; unfreeze_last_k=9
[e1 b1/2315] loss=1.6700 avg=1.6700 it/s=262.7
[e1 b2/2315] loss=1.5500 avg=1.6100 it/s=300.7


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e1 b232/2315] loss=1.4500 avg=1.5251 it/s=415.1
[e1 b463/2315] loss=1.0492 avg=1.4244 it/s=407.0
[e1 b694/2315] loss=1.0734 avg=1.3174 it/s=409.0
[e1 b925/2315] loss=0.9229 avg=1.2491 it/s=404.4
[e1 b1156/2315] loss=1.3077 avg=1.1963 it/s=403.0
[e1 b1387/2315] loss=0.7822 avg=1.1527 it/s=401.7
[e1 b1618/2315] loss=1.2129 avg=1.1125 it/s=402.1
[e1 b1849/2315] loss=0.8336 avg=1.0786 it/s=402.4
[e1 b2080/2315] loss=0.7952 avg=1.0496 it/s=403.4
[e1 b2311/2315] loss=0.6197 avg=1.0242 it/s=404.9


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 1/7 | loss=1.0235 | val_acc=0.7432 | val_f1=0.7494 | time=96.1s
[e2 b1/2315] loss=0.4566 avg=0.4566 it/s=361.1
[e2 b2/2315] loss=0.3759 avg=0.4162 it/s=340.3


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e2 b232/2315] loss=0.4065 avg=0.7214 it/s=419.9
[e2 b463/2315] loss=1.1747 avg=0.7230 it/s=421.8
[e2 b694/2315] loss=0.7187 avg=0.7215 it/s=421.9
[e2 b925/2315] loss=0.7976 avg=0.7172 it/s=419.3
[e2 b1156/2315] loss=0.8276 avg=0.7194 it/s=414.2
[e2 b1387/2315] loss=0.8591 avg=0.7138 it/s=410.0
[e2 b1618/2315] loss=0.7675 avg=0.7126 it/s=409.1
[e2 b1849/2315] loss=1.1426 avg=0.7089 it/s=410.9
[e2 b2080/2315] loss=0.8516 avg=0.7048 it/s=411.7
[e2 b2311/2315] loss=0.4289 avg=0.7020 it/s=412.3


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 2/7 | loss=0.7022 | val_acc=0.7930 | val_f1=0.7825 | time=94.5s
[e3 b1/2315] loss=0.5390 avg=0.5390 it/s=384.3
[e3 b2/2315] loss=0.6119 avg=0.5755 it/s=379.1


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e3 b232/2315] loss=0.3831 avg=0.5935 it/s=411.6
[e3 b463/2315] loss=0.4575 avg=0.6005 it/s=382.9
[e3 b694/2315] loss=0.3675 avg=0.6080 it/s=335.2
[e3 b925/2315] loss=0.3493 avg=0.6045 it/s=308.1
[e3 b1156/2315] loss=0.4551 avg=0.6043 it/s=325.1
[e3 b1387/2315] loss=0.5211 avg=0.6080 it/s=321.8
[e3 b1618/2315] loss=0.4486 avg=0.6050 it/s=327.8
[e3 b1849/2315] loss=0.5613 avg=0.6008 it/s=333.5
[e3 b2080/2315] loss=0.6189 avg=0.5963 it/s=340.5
[e3 b2311/2315] loss=0.3625 avg=0.5938 it/s=346.9


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 3/7 | loss=0.5937 | val_acc=0.8256 | val_f1=0.8310 | time=111.4s
[e4 b1/2315] loss=0.6803 avg=0.6803 it/s=366.8
[e4 b2/2315] loss=0.3776 avg=0.5289 it/s=383.6


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e4 b232/2315] loss=0.3853 avg=0.5163 it/s=406.4
[e4 b463/2315] loss=0.5433 avg=0.5205 it/s=358.7
[e4 b694/2315] loss=0.3997 avg=0.5212 it/s=319.0
[e4 b925/2315] loss=0.4810 avg=0.5249 it/s=303.2
[e4 b1156/2315] loss=0.4156 avg=0.5269 it/s=295.8
[e4 b1387/2315] loss=0.8996 avg=0.5273 it/s=287.1
[e4 b1618/2315] loss=0.3624 avg=0.5247 it/s=286.3
[e4 b1849/2315] loss=0.2584 avg=0.5233 it/s=283.4
[e4 b2080/2315] loss=0.4268 avg=0.5216 it/s=292.5
[e4 b2311/2315] loss=0.3350 avg=0.5200 it/s=300.0


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 4/7 | loss=0.5200 | val_acc=0.8484 | val_f1=0.8538 | time=128.0s
[e5 b1/2315] loss=0.3926 avg=0.3926 it/s=368.0
[e5 b2/2315] loss=0.3205 avg=0.3566 it/s=378.8


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e5 b232/2315] loss=0.5078 avg=0.4684 it/s=407.8
[e5 b463/2315] loss=0.6210 avg=0.4703 it/s=383.0
[e5 b694/2315] loss=0.2611 avg=0.4708 it/s=331.4
[e5 b925/2315] loss=0.4972 avg=0.4662 it/s=318.8
[e5 b1156/2315] loss=0.3055 avg=0.4657 it/s=334.5
[e5 b1387/2315] loss=0.6059 avg=0.4635 it/s=342.8
[e5 b1618/2315] loss=0.4560 avg=0.4599 it/s=348.2
[e5 b1849/2315] loss=0.2458 avg=0.4593 it/s=354.3
[e5 b2080/2315] loss=0.2819 avg=0.4589 it/s=360.6
[e5 b2311/2315] loss=0.5226 avg=0.4583 it/s=366.1


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 5/7 | loss=0.4582 | val_acc=0.8673 | val_f1=0.8708 | time=105.7s
[e6 b1/2315] loss=0.3400 avg=0.3400 it/s=414.0
[e6 b2/2315] loss=0.5729 avg=0.4564 it/s=417.5


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e6 b232/2315] loss=0.2943 avg=0.4160 it/s=419.4
[e6 b463/2315] loss=0.2911 avg=0.4132 it/s=422.1
[e6 b694/2315] loss=0.4893 avg=0.4157 it/s=402.5
[e6 b925/2315] loss=0.5473 avg=0.4125 it/s=355.2
[e6 b1156/2315] loss=0.3163 avg=0.4119 it/s=366.9
[e6 b1387/2315] loss=0.2283 avg=0.4091 it/s=375.8
[e6 b1618/2315] loss=0.4045 avg=0.4105 it/s=381.2
[e6 b1849/2315] loss=0.6442 avg=0.4075 it/s=380.9
[e6 b2080/2315] loss=0.4381 avg=0.4066 it/s=381.0
[e6 b2311/2315] loss=0.2352 avg=0.4049 it/s=380.2


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 6/7 | loss=0.4049 | val_acc=0.8608 | val_f1=0.8646 | time=102.2s
[e7 b1/2315] loss=0.3567 avg=0.3567 it/s=424.0
[e7 b2/2315] loss=0.4434 avg=0.4000 it/s=500.8


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e7 b232/2315] loss=0.1934 avg=0.3647 it/s=414.8
[e7 b463/2315] loss=0.7131 avg=0.3663 it/s=376.5
[e7 b694/2315] loss=0.4099 avg=0.3653 it/s=353.4
[e7 b925/2315] loss=0.3282 avg=0.3637 it/s=352.5
[e7 b1156/2315] loss=0.2050 avg=0.3637 it/s=345.2
[e7 b1387/2315] loss=0.4780 avg=0.3665 it/s=339.9
[e7 b1618/2315] loss=0.2018 avg=0.3658 it/s=346.8
[e7 b1849/2315] loss=0.3226 avg=0.3656 it/s=355.1
[e7 b2080/2315] loss=0.3388 avg=0.3640 it/s=361.0
[e7 b2311/2315] loss=0.1863 avg=0.3639 it/s=363.7


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
[34m[1mwandb[0m: [32m[41mERROR[0m The nbformat package was not found. It is required to save notebook history.


Epoch 7/7 | loss=0.3641 | val_acc=0.8632 | val_f1=0.8666 | time=106.4s


0,1
epoch,▁▁▁▁▁▂▂▂▂▂▂▃▃▃▃▃▅▅▅▅▆▆▆▆▆▆▆▇▇▇▇▇████████
lr,█▇▆▅▃▂▁
params/ratio,▁
params/total,▁
params/trainable,▁
step,▁▁▁▂▂▂▂▂▁▁▂▂▁▄▁▂▂▂▅▁▁▅▂▂▂▁▁▁▁▂▂▂▂▇▁▁▁▁▁█
time/epoch_sec,▁▁▅█▃▃▃
train/avg_loss_so_far,█▇▆▆▆▅▂▁▃▃▃▃▃▂▂▂▂▂▂▂▃▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁
train/epoch_loss,█▅▃▃▂▁▁
train/items_per_sec,▇▇▇▇▇██▇▇▇▃▃▃▅▅▁▁▅▆▇▃▄▅▅▅███▅▅▆▆▆▆██▆▄▄▅

0,1
best_checkpoint_path,checkpoints_midf1\be...
best_val_f1,0.86123
best_val_mid_f1,0.86123
epoch,7
lr,0
params/ratio,0.23093
params/total,278813189
params/trainable,64385285
step,16201
time/epoch_sec,106.38663


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
Best trial: 15. Best value: 0.876329:  95%|█████████▌| 19/20 [3:20:46<10:31, 631.72s/it]

[Trial 18] f1=0.8708 | unfreeze_k=9 lr=6.36e-05 wd=7.9e-07 suggested_bs=8
[I 2025-08-17 21:08:49,200] Trial 18 finished with value: 0.8707660920187703 and parameters: {'num_unfreeze_last_layers': 9, 'lr': 6.361064552508168e-05, 'weight_decay': 7.892964547831696e-07, 'batch_size': 8}. Best is trial 15 with value: 0.8763286523329038.


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/mdeberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  scaler = GradScaler(enabled=(DEVICE == "cuda" and USE_AMP))


Trainable params: 78,561,029 / 278,813,189 (28.18%) ; unfreeze_last_k=11
[e1 b1/2315] loss=1.6019 avg=1.6019 it/s=226.6
[e1 b2/2315] loss=1.5452 avg=1.5735 it/s=255.7


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e1 b232/2315] loss=1.6070 avg=1.5548 it/s=357.7
[e1 b463/2315] loss=1.3850 avg=1.5079 it/s=358.5
[e1 b694/2315] loss=1.3630 avg=1.4823 it/s=358.8
[e1 b925/2315] loss=0.9542 avg=1.4355 it/s=362.3
[e1 b1156/2315] loss=1.3222 avg=1.3682 it/s=366.8
[e1 b1387/2315] loss=0.8654 avg=1.3106 it/s=368.9
[e1 b1618/2315] loss=1.0405 avg=1.2579 it/s=371.2
[e1 b1849/2315] loss=0.7652 avg=1.2099 it/s=371.8
[e1 b2080/2315] loss=1.1210 avg=1.1717 it/s=373.1
[e1 b2311/2315] loss=0.7185 avg=1.1354 it/s=372.4


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 1/7 | loss=1.1347 | val_acc=0.7519 | val_f1=0.7575 | time=104.2s
[e2 b1/2315] loss=0.7779 avg=0.7779 it/s=378.6
[e2 b2/2315] loss=0.6491 avg=0.7135 it/s=380.7


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e2 b232/2315] loss=0.5072 avg=0.7492 it/s=360.0
[e2 b463/2315] loss=0.7730 avg=0.7593 it/s=363.5
[e2 b694/2315] loss=0.6168 avg=0.7454 it/s=369.3
[e2 b925/2315] loss=0.5957 avg=0.7355 it/s=371.9
[e2 b1156/2315] loss=0.4291 avg=0.7249 it/s=373.2
[e2 b1387/2315] loss=0.5943 avg=0.7206 it/s=375.2
[e2 b1618/2315] loss=0.6351 avg=0.7169 it/s=376.4
[e2 b1849/2315] loss=1.0557 avg=0.7107 it/s=377.7
[e2 b2080/2315] loss=0.5256 avg=0.7034 it/s=378.9
[e2 b2311/2315] loss=0.9365 avg=0.6969 it/s=379.9


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 2/7 | loss=0.6968 | val_acc=0.8202 | val_f1=0.8252 | time=102.1s
[e3 b1/2315] loss=0.6268 avg=0.6268 it/s=411.6
[e3 b2/2315] loss=0.3765 avg=0.5017 it/s=365.7


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e3 b232/2315] loss=0.6070 avg=0.5967 it/s=354.0
[e3 b463/2315] loss=0.7024 avg=0.5995 it/s=345.5
[e3 b694/2315] loss=0.6969 avg=0.5981 it/s=346.7
[e3 b925/2315] loss=0.4670 avg=0.5935 it/s=353.6
[e3 b1156/2315] loss=0.3206 avg=0.5909 it/s=358.7
[e3 b1387/2315] loss=0.6549 avg=0.5885 it/s=362.3
[e3 b1618/2315] loss=0.5566 avg=0.5853 it/s=361.0
[e3 b1849/2315] loss=0.4816 avg=0.5809 it/s=359.1
[e3 b2080/2315] loss=0.7244 avg=0.5791 it/s=358.3
[e3 b2311/2315] loss=0.5066 avg=0.5755 it/s=361.1


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 3/7 | loss=0.5754 | val_acc=0.8297 | val_f1=0.8346 | time=107.1s
[e4 b1/2315] loss=0.4757 avg=0.4757 it/s=387.7
[e4 b2/2315] loss=0.5356 avg=0.5057 it/s=354.2


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e4 b232/2315] loss=0.4226 avg=0.5144 it/s=386.7
[e4 b463/2315] loss=0.5942 avg=0.5237 it/s=380.8
[e4 b694/2315] loss=0.2551 avg=0.5170 it/s=377.4
[e4 b925/2315] loss=0.3168 avg=0.5132 it/s=374.3
[e4 b1156/2315] loss=0.6630 avg=0.5112 it/s=373.8
[e4 b1387/2315] loss=1.0948 avg=0.5060 it/s=373.0
[e4 b1618/2315] loss=0.5901 avg=0.5054 it/s=371.4
[e4 b1849/2315] loss=0.5337 avg=0.5038 it/s=370.4
[e4 b2080/2315] loss=0.6858 avg=0.5045 it/s=371.5
[e4 b2311/2315] loss=0.2837 avg=0.5023 it/s=371.9


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 4/7 | loss=0.5022 | val_acc=0.8472 | val_f1=0.8517 | time=104.2s
[e5 b1/2315] loss=0.5320 avg=0.5320 it/s=285.1
[e5 b2/2315] loss=0.5818 avg=0.5569 it/s=328.4


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e5 b232/2315] loss=0.6266 avg=0.4812 it/s=381.4
[e5 b463/2315] loss=0.2766 avg=0.4793 it/s=378.3
[e5 b694/2315] loss=0.6094 avg=0.4720 it/s=380.5
[e5 b925/2315] loss=0.4356 avg=0.4685 it/s=375.6
[e5 b1156/2315] loss=0.5469 avg=0.4690 it/s=371.1
[e5 b1387/2315] loss=0.4242 avg=0.4656 it/s=369.0
[e5 b1618/2315] loss=0.3562 avg=0.4643 it/s=368.7
[e5 b1849/2315] loss=0.6112 avg=0.4649 it/s=368.2
[e5 b2080/2315] loss=0.2231 avg=0.4643 it/s=364.4
[e5 b2311/2315] loss=0.5629 avg=0.4633 it/s=365.6


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 5/7 | loss=0.4634 | val_acc=0.8336 | val_f1=0.8372 | time=106.0s
[e6 b1/2315] loss=1.2789 avg=1.2789 it/s=385.4
[e6 b2/2315] loss=0.2670 avg=0.7729 it/s=437.5


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e6 b232/2315] loss=0.4764 avg=0.4460 it/s=379.2
[e6 b463/2315] loss=0.4214 avg=0.4345 it/s=377.9
[e6 b694/2315] loss=0.5944 avg=0.4412 it/s=378.6
[e6 b925/2315] loss=0.5741 avg=0.4381 it/s=380.2
[e6 b1156/2315] loss=0.3822 avg=0.4409 it/s=374.3
[e6 b1387/2315] loss=0.7623 avg=0.4413 it/s=369.2
[e6 b1618/2315] loss=0.7404 avg=0.4390 it/s=364.4
[e6 b1849/2315] loss=0.3665 avg=0.4373 it/s=362.7
[e6 b2080/2315] loss=0.3933 avg=0.4354 it/s=363.6
[e6 b2311/2315] loss=0.3808 avg=0.4342 it/s=365.4


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 6/7 | loss=0.4344 | val_acc=0.8579 | val_f1=0.8615 | time=106.2s
[e7 b1/2315] loss=0.2107 avg=0.2107 it/s=338.4
[e7 b2/2315] loss=0.4453 avg=0.3280 it/s=281.6


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e7 b232/2315] loss=0.3662 avg=0.4126 it/s=345.2
[e7 b463/2315] loss=0.3040 avg=0.4206 it/s=343.0
[e7 b694/2315] loss=0.2772 avg=0.4154 it/s=355.1
[e7 b925/2315] loss=0.3097 avg=0.4150 it/s=362.8
[e7 b1156/2315] loss=0.4270 avg=0.4165 it/s=366.6
[e7 b1387/2315] loss=0.5602 avg=0.4154 it/s=364.8
[e7 b1618/2315] loss=0.2470 avg=0.4124 it/s=365.3
[e7 b1849/2315] loss=0.2069 avg=0.4101 it/s=364.0
[e7 b2080/2315] loss=0.4347 avg=0.4097 it/s=365.7
[e7 b2311/2315] loss=0.2933 avg=0.4103 it/s=364.8


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
[34m[1mwandb[0m: [32m[41mERROR[0m The nbformat package was not found. It is required to save notebook history.


Epoch 7/7 | loss=0.4107 | val_acc=0.8547 | val_f1=0.8581 | time=106.2s


0,1
epoch,▁▁▁▁▁▁▂▂▂▂▂▂▂▂▃▃▃▃▅▅▅▆▆▆▆▆▆▆▇▇▇▇▇▇▇█████
lr,█▇▆▅▃▂▁
params/ratio,▁
params/total,▁
params/trainable,▁
step,▁▁▁▁▁▁▂▁▁▁▂▁▂▂▂▂▁▂▄▂▂▂▂▂▂▁▅▂▆▆▂▂▂▇▁▁▂▂▂█
time/epoch_sec,▄▁█▄▆▇▇
train/avg_loss_so_far,██▇▇▇▆▄▄▄▄▄▄▄▃▃▃▃▃▃▃▃▂▃▂▂▆▄▂▂▂▂▂▂▁▂▂▂▂▂▂
train/epoch_loss,█▄▃▂▂▁▁
train/items_per_sec,▁▇▇▇▇██▇▇███▇▆▆▇▇▇▇▆████▇▃▅██▇██▇▆▂▆▇▇▇▇

0,1
best_checkpoint_path,checkpoints_midf1\be...
best_val_f1,0.84849
best_val_mid_f1,0.84849
epoch,7
lr,0
params/ratio,0.28177
params/total,278813189
params/trainable,78561029
step,16201
time/epoch_sec,106.20208


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
Best trial: 15. Best value: 0.876329: 100%|██████████| 20/20 [3:33:15<00:00, 639.78s/it]

[Trial 19] f1=0.8615 | unfreeze_k=11 lr=1.54e-05 wd=6.3e-06 suggested_bs=8
[I 2025-08-17 21:21:18,523] Trial 19 finished with value: 0.8615373448847018 and parameters: {'num_unfreeze_last_layers': 11, 'lr': 1.5441131471553194e-05, 'weight_decay': 6.254940759615576e-06, 'batch_size': 8}. Best is trial 15 with value: 0.8763286523329038.
Best trial: 15 F1: 0.8763286523329038





## 📊 Results – Weighted Loss Experiment (mid-class emphasis)

This second run (full training, Ex.4 style + weighted loss) gave **more consistent improvements** compared to the plain loss function:

- The **best trial** reached **val/F1 = 0.879** and **mid-class F1 = 0.867**,  
  which is stronger than our earlier unweighted runs.  
- The **top trials (11, 15, 8, 13)** all sit around **0.86–0.87 F1**,  
  showing that the weighting helped stabilize the optimization and reduce variance.  
- As expected, the **best LRs are still in the range 2.5e-5 – 7.5e-5** with small weight decay (≈1e-6–1e-5).  
- Best unfreeze depth remains **10–12 layers**, similar to before.  
- Almost all successful runs used **batch size = 8**, confirming smaller batches give more stable convergence.  

---

### 🚀 Key Takeaways
- Weighted loss **did boost performance on the mid classes**, which were previously weaker.  
- Even though absolute F1 gains are modest, this approach reduced the gap between extreme vs. mid sentiments.  
- Bad trials (very high learning rate >1e-3 or shallow unfreezing <8) still collapse (F1 ~0.14),  
  but **the good region is now clearer and more reproducible**.  

---

We will now check for test results. 
  


In [4]:


with open(os.path.join("checkpoints_midf1", "best_hparams_optuna_midf1.json")) as f:
    best_hparams = json.load(f)

# give a distinct name for the final run
best_hparams["run_name"] = f"{BASE_RUN_NAME}_best_optuna_weighed_classes"

# (optional) bump epochs here; see guidance below
best_hparams["epochs"] = 15      # higher cap
best_hparams["patience"] = 4     # unchanged
# ckpt, val = train_one_run(best_hparams)

#
# best_ckpt_path, best_val = train_one_run(best_hparams)
# print("Final best checkpoint:", best_ckpt_path)
# print("Final val metrics:", best_val)


In [5]:
# Retrain best config to get a clean checkpoint
best_ckpt, _ = train_one_run(best_hparams)
best_path = best_ckpt


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/mdeberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  scaler = GradScaler(enabled=(DEVICE == "cuda" and USE_AMP))


Trainable params: 78,561,029 / 278,813,189 (28.18%) ; unfreeze_last_k=11
[e1 b1/2315] loss=1.6515 avg=1.6515 it/s=239.0
[e1 b2/2315] loss=1.6300 avg=1.6407 it/s=316.0


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e1 b232/2315] loss=1.4546 avg=1.5346 it/s=328.2
[e1 b463/2315] loss=1.2486 avg=1.4866 it/s=346.9
[e1 b694/2315] loss=1.2882 avg=1.4268 it/s=350.5
[e1 b925/2315] loss=0.9568 avg=1.3558 it/s=352.5
[e1 b1156/2315] loss=0.5954 avg=1.2882 it/s=355.5
[e1 b1387/2315] loss=1.0275 avg=1.2351 it/s=358.5
[e1 b1618/2315] loss=0.8343 avg=1.1911 it/s=362.1
[e1 b1849/2315] loss=0.6917 avg=1.1536 it/s=358.7
[e1 b2080/2315] loss=0.7476 avg=1.1215 it/s=336.6
[e1 b2311/2315] loss=0.9310 avg=1.0926 it/s=340.8


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 1/15 | loss=1.0923 | val_acc=0.6919 | val_f1=0.6834 | time=113.2s
[e2 b1/2315] loss=0.7860 avg=0.7860 it/s=281.8
[e2 b2/2315] loss=0.6493 avg=0.7176 it/s=319.0


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e2 b232/2315] loss=0.9844 avg=0.7798 it/s=383.3
[e2 b463/2315] loss=0.7437 avg=0.7647 it/s=383.1
[e2 b694/2315] loss=0.9229 avg=0.7570 it/s=376.2
[e2 b925/2315] loss=0.6513 avg=0.7501 it/s=366.9
[e2 b1156/2315] loss=0.6151 avg=0.7455 it/s=360.0
[e2 b1387/2315] loss=0.9574 avg=0.7377 it/s=362.3
[e2 b1618/2315] loss=0.7503 avg=0.7301 it/s=364.9
[e2 b1849/2315] loss=0.4329 avg=0.7224 it/s=366.7
[e2 b2080/2315] loss=0.4973 avg=0.7182 it/s=366.8
[e2 b2311/2315] loss=0.9460 avg=0.7146 it/s=365.2


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 2/15 | loss=0.7147 | val_acc=0.8141 | val_f1=0.8212 | time=106.2s
[e3 b1/2315] loss=0.5274 avg=0.5274 it/s=349.9
[e3 b2/2315] loss=0.4017 avg=0.4646 it/s=335.4


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e3 b232/2315] loss=0.5802 avg=0.5970 it/s=367.2
[e3 b463/2315] loss=0.4904 avg=0.6080 it/s=378.9
[e3 b694/2315] loss=0.3945 avg=0.6007 it/s=382.9
[e3 b925/2315] loss=0.6966 avg=0.6011 it/s=382.2
[e3 b1156/2315] loss=0.4373 avg=0.6028 it/s=378.6
[e3 b1387/2315] loss=0.7323 avg=0.6024 it/s=377.1
[e3 b1618/2315] loss=0.5385 avg=0.6016 it/s=374.5
[e3 b1849/2315] loss=0.3216 avg=0.6003 it/s=373.6
[e3 b2080/2315] loss=0.3795 avg=0.5995 it/s=373.3
[e3 b2311/2315] loss=0.3147 avg=0.5987 it/s=373.0


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 3/15 | loss=0.5985 | val_acc=0.8178 | val_f1=0.8195 | time=103.9s
[e4 b1/2315] loss=0.4736 avg=0.4736 it/s=419.0
[e4 b2/2315] loss=0.3397 avg=0.4067 it/s=350.1


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e4 b232/2315] loss=0.7414 avg=0.5443 it/s=376.7
[e4 b463/2315] loss=0.5379 avg=0.5450 it/s=379.3
[e4 b694/2315] loss=0.3739 avg=0.5389 it/s=380.1
[e4 b925/2315] loss=0.3814 avg=0.5348 it/s=380.6
[e4 b1156/2315] loss=0.7599 avg=0.5332 it/s=382.0
[e4 b1387/2315] loss=0.3558 avg=0.5338 it/s=380.5
[e4 b1618/2315] loss=0.4625 avg=0.5325 it/s=377.4
[e4 b1849/2315] loss=0.5923 avg=0.5342 it/s=376.8
[e4 b2080/2315] loss=0.3626 avg=0.5327 it/s=377.0
[e4 b2311/2315] loss=0.5034 avg=0.5338 it/s=378.0


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 4/15 | loss=0.5338 | val_acc=0.8389 | val_f1=0.8412 | time=102.5s
[e5 b1/2315] loss=0.6025 avg=0.6025 it/s=352.5
[e5 b2/2315] loss=0.4433 avg=0.5229 it/s=346.2


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e5 b232/2315] loss=0.3617 avg=0.4701 it/s=382.2
[e5 b463/2315] loss=0.2705 avg=0.4806 it/s=380.4
[e5 b694/2315] loss=0.7763 avg=0.4830 it/s=380.9
[e5 b925/2315] loss=0.5015 avg=0.4839 it/s=382.5
[e5 b1156/2315] loss=0.3797 avg=0.4873 it/s=383.9
[e5 b1387/2315] loss=0.2947 avg=0.4869 it/s=384.6
[e5 b1618/2315] loss=0.3788 avg=0.4857 it/s=384.3
[e5 b1849/2315] loss=0.7629 avg=0.4868 it/s=379.3
[e5 b2080/2315] loss=0.6802 avg=0.4872 it/s=374.5
[e5 b2311/2315] loss=0.7892 avg=0.4851 it/s=371.8


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 5/15 | loss=0.4851 | val_acc=0.8435 | val_f1=0.8474 | time=104.2s
[e6 b1/2315] loss=0.6863 avg=0.6863 it/s=321.7
[e6 b2/2315] loss=0.4430 avg=0.5646 it/s=337.1


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e6 b232/2315] loss=0.2135 avg=0.4511 it/s=379.8
[e6 b463/2315] loss=0.1977 avg=0.4441 it/s=382.8
[e6 b694/2315] loss=0.2142 avg=0.4458 it/s=374.4
[e6 b925/2315] loss=0.3217 avg=0.4400 it/s=369.0
[e6 b1156/2315] loss=0.3781 avg=0.4386 it/s=365.7
[e6 b1387/2315] loss=0.7144 avg=0.4440 it/s=365.3
[e6 b1618/2315] loss=0.2255 avg=0.4429 it/s=363.8
[e6 b1849/2315] loss=0.3836 avg=0.4446 it/s=363.7
[e6 b2080/2315] loss=0.2567 avg=0.4447 it/s=363.8
[e6 b2311/2315] loss=0.7182 avg=0.4474 it/s=363.2


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 6/15 | loss=0.4475 | val_acc=0.8581 | val_f1=0.8607 | time=106.7s
[e7 b1/2315] loss=0.2814 avg=0.2814 it/s=338.3
[e7 b2/2315] loss=0.5742 avg=0.4278 it/s=375.0


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e7 b232/2315] loss=0.2220 avg=0.4152 it/s=375.1
[e7 b463/2315] loss=0.3796 avg=0.4062 it/s=367.8
[e7 b694/2315] loss=0.3240 avg=0.4065 it/s=369.6
[e7 b925/2315] loss=0.5802 avg=0.4121 it/s=369.4
[e7 b1156/2315] loss=0.4060 avg=0.4127 it/s=372.6
[e7 b1387/2315] loss=0.3994 avg=0.4150 it/s=374.0
[e7 b1618/2315] loss=0.3130 avg=0.4143 it/s=375.0
[e7 b1849/2315] loss=0.9441 avg=0.4171 it/s=375.6
[e7 b2080/2315] loss=0.5803 avg=0.4162 it/s=376.2
[e7 b2311/2315] loss=0.3659 avg=0.4152 it/s=374.0


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 7/15 | loss=0.4151 | val_acc=0.8622 | val_f1=0.8646 | time=103.7s
[e8 b1/2315] loss=0.3302 avg=0.3302 it/s=409.5
[e8 b2/2315] loss=0.3720 avg=0.3511 it/s=393.1


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e8 b232/2315] loss=0.8476 avg=0.3722 it/s=358.2
[e8 b463/2315] loss=0.2176 avg=0.3800 it/s=365.8
[e8 b694/2315] loss=0.4325 avg=0.3893 it/s=372.4
[e8 b925/2315] loss=0.2408 avg=0.3912 it/s=373.2
[e8 b1156/2315] loss=0.3664 avg=0.3933 it/s=374.3
[e8 b1387/2315] loss=0.2294 avg=0.3931 it/s=373.3
[e8 b1618/2315] loss=0.2881 avg=0.3939 it/s=374.2
[e8 b1849/2315] loss=0.3368 avg=0.3920 it/s=375.2
[e8 b2080/2315] loss=0.3969 avg=0.3914 it/s=375.3
[e8 b2311/2315] loss=0.4150 avg=0.3910 it/s=375.7


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 8/15 | loss=0.3911 | val_acc=0.8644 | val_f1=0.8679 | time=103.1s
[e9 b1/2315] loss=0.2956 avg=0.2956 it/s=324.9
[e9 b2/2315] loss=0.4088 avg=0.3522 it/s=353.8


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e9 b232/2315] loss=0.2344 avg=0.3639 it/s=364.0
[e9 b463/2315] loss=0.4339 avg=0.3609 it/s=350.6
[e9 b694/2315] loss=0.1968 avg=0.3607 it/s=342.4
[e9 b925/2315] loss=0.3294 avg=0.3614 it/s=350.4
[e9 b1156/2315] loss=0.2863 avg=0.3590 it/s=355.8
[e9 b1387/2315] loss=0.2383 avg=0.3606 it/s=358.3
[e9 b1618/2315] loss=0.6083 avg=0.3611 it/s=357.6
[e9 b1849/2315] loss=0.5238 avg=0.3615 it/s=355.9
[e9 b2080/2315] loss=0.2077 avg=0.3638 it/s=353.9
[e9 b2311/2315] loss=0.2038 avg=0.3623 it/s=353.7


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 9/15 | loss=0.3622 | val_acc=0.8727 | val_f1=0.8755 | time=109.4s
[e10 b1/2315] loss=0.3582 avg=0.3582 it/s=277.5
[e10 b2/2315] loss=0.2633 avg=0.3107 it/s=306.2


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e10 b232/2315] loss=0.2044 avg=0.3298 it/s=339.7
[e10 b463/2315] loss=0.2034 avg=0.3402 it/s=343.7
[e10 b694/2315] loss=0.1866 avg=0.3380 it/s=348.4
[e10 b925/2315] loss=0.2044 avg=0.3402 it/s=349.2
[e10 b1156/2315] loss=0.4210 avg=0.3376 it/s=352.8
[e10 b1387/2315] loss=0.2004 avg=0.3370 it/s=354.6
[e10 b1618/2315] loss=0.4996 avg=0.3386 it/s=355.6
[e10 b1849/2315] loss=0.2118 avg=0.3396 it/s=355.4
[e10 b2080/2315] loss=0.3073 avg=0.3383 it/s=357.6
[e10 b2311/2315] loss=0.4265 avg=0.3379 it/s=359.8


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 10/15 | loss=0.3379 | val_acc=0.8700 | val_f1=0.8735 | time=107.5s
[e11 b1/2315] loss=0.3809 avg=0.3809 it/s=378.6
[e11 b2/2315] loss=0.4327 avg=0.4068 it/s=429.1


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e11 b232/2315] loss=0.5474 avg=0.3168 it/s=382.7
[e11 b463/2315] loss=0.6194 avg=0.3186 it/s=385.2
[e11 b694/2315] loss=0.3039 avg=0.3256 it/s=383.4
[e11 b925/2315] loss=0.6494 avg=0.3226 it/s=378.0
[e11 b1156/2315] loss=0.2228 avg=0.3210 it/s=375.6
[e11 b1387/2315] loss=0.2011 avg=0.3222 it/s=374.9
[e11 b1618/2315] loss=0.2174 avg=0.3206 it/s=377.0
[e11 b1849/2315] loss=0.5074 avg=0.3200 it/s=377.9
[e11 b2080/2315] loss=0.3983 avg=0.3197 it/s=359.5
[e11 b2311/2315] loss=0.1989 avg=0.3180 it/s=348.7


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 11/15 | loss=0.3182 | val_acc=0.8690 | val_f1=0.8719 | time=113.9s
[e12 b1/2315] loss=0.4724 avg=0.4724 it/s=272.7
[e12 b2/2315] loss=0.4184 avg=0.4454 it/s=290.8


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e12 b232/2315] loss=0.2179 avg=0.3048 it/s=278.6
[e12 b463/2315] loss=0.6319 avg=0.3046 it/s=297.5
[e12 b694/2315] loss=0.2103 avg=0.2988 it/s=270.3
[e12 b925/2315] loss=0.2134 avg=0.2996 it/s=282.0
[e12 b1156/2315] loss=0.2221 avg=0.3009 it/s=293.8
[e12 b1387/2315] loss=0.2212 avg=0.2997 it/s=304.9
[e12 b1618/2315] loss=0.6515 avg=0.3015 it/s=314.1
[e12 b1849/2315] loss=0.2259 avg=0.3009 it/s=321.4
[e12 b2080/2315] loss=0.2932 avg=0.3004 it/s=325.1
[e12 b2311/2315] loss=0.2172 avg=0.2993 it/s=328.7


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 12/15 | loss=0.2992 | val_acc=0.8810 | val_f1=0.8837 | time=117.3s
[e13 b1/2315] loss=0.2021 avg=0.2021 it/s=284.8
[e13 b2/2315] loss=0.4142 avg=0.3081 it/s=267.7


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e13 b232/2315] loss=0.5314 avg=0.2878 it/s=363.2
[e13 b463/2315] loss=0.2171 avg=0.2871 it/s=363.5
[e13 b694/2315] loss=0.2032 avg=0.2857 it/s=361.6
[e13 b925/2315] loss=0.1906 avg=0.2851 it/s=348.4
[e13 b1156/2315] loss=0.4455 avg=0.2850 it/s=331.6
[e13 b1387/2315] loss=0.6391 avg=0.2858 it/s=336.3
[e13 b1618/2315] loss=0.2255 avg=0.2845 it/s=340.2
[e13 b1849/2315] loss=0.2034 avg=0.2829 it/s=344.5
[e13 b2080/2315] loss=0.1958 avg=0.2823 it/s=345.4
[e13 b2311/2315] loss=0.2252 avg=0.2828 it/s=348.6


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 13/15 | loss=0.2828 | val_acc=0.8729 | val_f1=0.8762 | time=110.9s
[e14 b1/2315] loss=0.3902 avg=0.3902 it/s=370.2
[e14 b2/2315] loss=0.2036 avg=0.2969 it/s=396.3


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e14 b232/2315] loss=0.5252 avg=0.2781 it/s=381.3
[e14 b463/2315] loss=0.2478 avg=0.2762 it/s=355.0
[e14 b694/2315] loss=0.2752 avg=0.2776 it/s=308.5
[e14 b925/2315] loss=0.2102 avg=0.2738 it/s=282.7
[e14 b1156/2315] loss=0.4516 avg=0.2728 it/s=295.1
[e14 b1387/2315] loss=0.5546 avg=0.2731 it/s=305.4
[e14 b1618/2315] loss=0.2114 avg=0.2719 it/s=313.7
[e14 b1849/2315] loss=0.3097 avg=0.2705 it/s=321.1
[e14 b2080/2315] loss=0.4334 avg=0.2708 it/s=327.2
[e14 b2311/2315] loss=0.2371 avg=0.2706 it/s=332.3


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


Epoch 14/15 | loss=0.2705 | val_acc=0.8751 | val_f1=0.8780 | time=116.0s
[e15 b1/2315] loss=0.2192 avg=0.2192 it/s=386.4
[e15 b2/2315] loss=0.2025 avg=0.2109 it/s=394.3


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e15 b232/2315] loss=0.2093 avg=0.2586 it/s=389.4
[e15 b463/2315] loss=0.2093 avg=0.2582 it/s=388.4
[e15 b694/2315] loss=0.2091 avg=0.2580 it/s=384.4
[e15 b925/2315] loss=0.2029 avg=0.2587 it/s=340.0
[e15 b1156/2315] loss=0.2098 avg=0.2601 it/s=321.4
[e15 b1387/2315] loss=0.1895 avg=0.2619 it/s=323.9
[e15 b1618/2315] loss=0.2427 avg=0.2598 it/s=327.4
[e15 b1849/2315] loss=0.5179 avg=0.2611 it/s=333.2
[e15 b2080/2315] loss=0.4498 avg=0.2607 it/s=337.4
[e15 b2311/2315] loss=0.4474 avg=0.2611 it/s=341.7


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
[34m[1mwandb[0m: [32m[41mERROR[0m The nbformat package was not found. It is required to save notebook history.


Epoch 15/15 | loss=0.2612 | val_acc=0.8763 | val_f1=0.8795 | time=113.2s


0,1
epoch,▁▁▁▁▁▁▁▁▂▃▃▃▃▃▃▃▃▄▄▄▄▅▅▅▅▅▇▇▇▇▇▇▇▇▇▇▇███
lr,██▇▇▆▆▅▅▄▄▃▃▂▂▁
params/ratio,▁
params/total,▁
params/trainable,▁
step,▁▁▁▁▂▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇▁▁█▁▁▁
time/epoch_sec,▆▃▂▁▂▃▂▁▄▃▆█▅▇▆
train/avg_loss_so_far,█▆▃▄▄▃▃▃▂▃▂▂▂▃▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▁▂▁▁▁▁▁▁▁▁▁
train/epoch_loss,█▅▄▃▃▃▂▂▂▂▁▁▁▁▁
train/items_per_sec,▄▆▆▅▅▆▆▅▇▅█▇▇█▇▇▇▇▇▇▇▇▇▆▆▆▁▅▆▆▇▇▇▁▁▄▄▇▆█

0,1
best_checkpoint_path,checkpoints_midf1\be...
best_val_f1,0.87257
best_val_mid_f1,0.87257
epoch,15
lr,0
params/ratio,0.28177
params/total,278813189
params/trainable,78561029
step,34721
time/epoch_sec,113.21217


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


## 🧾 Final Test Results – Weighted Loss Run  

**Overall metrics**  
`acc = 0.8626 | f1_macro = 0.8658 | precision_macro = 0.8636 | recall_macro = 0.8686`  

---

## 🔄 Comparison with Ex.4 Baseline (no weighted loss)

| Label                | Ex.4 F1 | Weighted F1 | Change |
|-----------------------|---------|-------------|--------|
| extremely negative    | 0.89    | 0.88        | -0.01  |
| negative              | 0.87    | 0.85        | -0.02  |
| neutral               | 0.86    | 0.86        | ~0.00  |
| positive              | 0.84    | 0.85        | +0.01  |
| extremely positive    | 0.88    | 0.89        | +0.01  |
| **Macro avg**         | **0.87**| **0.87**    | ~0.00  |
| **Accuracy**          | **0.87**| **0.86**    | -0.01  |

---

### 📌 Takeaways
- **Weighted loss did not drastically change overall macro F1 (~0.87 → ~0.87).**  
- Gains are seen in **positive and extremely positive**, which improved slightly.  
- The **negative class** dropped a bit, suggesting the weighting may have shifted attention slightly away from it.  
- Macro balance looks good — the mid classes (neutral, positive) now perform closer to extremes.  

✅ In short: weighted loss **kept overall performance stable** while slightly **balancing mid-class performance**.  



In [6]:
# # Retrain best config to get a clean checkpoint
# best_ckpt, _ = train_one_run(best_params)
# best_path = best_ckpt
best_params=best_hparams
# -------------------------
# Final evaluation on TEST (+ W&B logging)
# -------------------------
model = build_model(best_params["num_unfreeze_last_layers"])
model.load_state_dict(torch.load(best_path, map_location=DEVICE))
model.eval()

all_preds, all_labels = [], []
with torch.no_grad():
    for batch in test_loader:
        batch = {k: v.to(DEVICE, non_blocking=True) for k, v in batch.items()}
        with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
                      enabled=(DEVICE == "cuda" and USE_AMP)):
            logits = model(**batch).logits
        all_preds.extend(torch.argmax(logits, dim=-1).detach().cpu().tolist())
        all_labels.extend(batch["labels"].detach().cpu().tolist())

acc = accuracy_score(all_labels, all_preds)
p, r, f1, _ = precision_recall_fscore_support(all_labels, all_preds, average="macro", zero_division=0)
print(f"\nTEST | acc={acc:.4f} | f1_macro={f1:.4f} | precision_macro={p:.4f} | recall_macro={r:.4f}\n")

print("Per-class report (ids map to labels):")
print(ID2LABEL)
report = classification_report(
    all_labels, all_preds,
    target_names=[ID2LABEL[i] for i in range(len(ORDER))],
    zero_division=0, output_dict=True
)
print(classification_report(
    all_labels, all_preds,
    target_names=[ID2LABEL[i] for i in range(len(ORDER))],
    zero_division=0
))



Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/mdeberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),



TEST | acc=0.8626 | f1_macro=0.8658 | precision_macro=0.8636 | recall_macro=0.8686

Per-class report (ids map to labels):
{0: 'extremely negative', 1: 'negative', 2: 'neutral', 3: 'positive', 4: 'extremely positive'}
                    precision    recall  f1-score   support

extremely negative       0.86      0.90      0.88       592
          negative       0.86      0.84      0.85      1041
           neutral       0.84      0.89      0.86       619
          positive       0.86      0.83      0.85       947
extremely positive       0.90      0.88      0.89       599

          accuracy                           0.86      3798
         macro avg       0.86      0.87      0.87      3798
      weighted avg       0.86      0.86      0.86      3798



In [7]:
# ---- W&B: log test metrics, per-class scores, and confusion matrix ----
test_run = wandb.init(project=PROJECT, name=f"{BASE_RUN_NAME}_test_wighted_classes", resume="allow", reinit=True)
log_payload = {
    "test/acc": acc,
    "test/precision_macro": p,
    "test/recall_macro": r,
    "test/f1_macro": f1,
}
for cls_name in ORDER:
    if cls_name in report:
        log_payload[f"test/{cls_name}/precision"] = report[cls_name]["precision"]
        log_payload[f"test/{cls_name}/recall"]    = report[cls_name]["recall"]
        log_payload[f"test/{cls_name}/f1"]        = report[cls_name]["f1-score"]

wandb.log(log_payload)

cm = confusion_matrix(all_labels, all_preds, labels=list(range(len(ORDER))))
wandb.log({
    "test/confusion_matrix": wandb.plot.confusion_matrix(
        y_true=all_labels,
        preds=all_preds,
        class_names=[ID2LABEL[i] for i in range(len(ID2LABEL))]
    )
})
test_run.summary["best_checkpoint_path"] = best_path
test_run.summary["test_f1_macro"] = f1
wandb.finish()



[34m[1mwandb[0m: [32m[41mERROR[0m The nbformat package was not found. It is required to save notebook history.


0,1
test/acc,▁
test/extremely negative/f1,▁
test/extremely negative/precision,▁
test/extremely negative/recall,▁
test/extremely positive/f1,▁
test/extremely positive/precision,▁
test/extremely positive/recall,▁
test/f1_macro,▁
test/negative/f1,▁
test/negative/precision,▁

0,1
best_checkpoint_path,checkpoints_midf1\be...
test/acc,0.86256
test/extremely negative/f1,0.87531
test/extremely negative/precision,0.85622
test/extremely negative/recall,0.89527
test/extremely positive/f1,0.89226
test/extremely positive/precision,0.89983
test/extremely positive/recall,0.88481
test/f1_macro,0.86578
test/negative/f1,0.85133


## ⚖️ Extra Trials – Ratio as HP

We also tried an extended search where the **mid/extreme weight ratio** (`ratio_mid_ext`) itself was treated as a **hyperparameter**, alongside the usual learning rate, weight decay, batch size, and unfreezing depth.  

🔍 **Outcome:**  
- The results did **not show consistent improvements** beyond our fixed-weight run.  
- Best trials stayed around the same macro F1 (~0.86–0.87), without a clear advantage over the non-weighted baseline.  
- Some mid-class F1 gains were offset by drops in other classes, leaving overall performance unchanged.

---

⏳ **Decision**: Since we are limited in time and weighted models did not prove a clear benefit, we will **not include weighted-loss variants in the final ensemble**.  
We will continue with the **standard models** that showed stronger and more stable results in Ex.4/Ex.5.  


In [None]:
# # =========================
# # ADV DL – Part B: Monolingual baseline (DeBERTa) – Exercise-4 style
# # Custom loop + early stopping + W&B + Optuna; freeze base, unfreeze last k layers
# # Focus: improve mid-class (negative/neutral/positive) via tiered class-weighted loss
# # Uses df_train / df_test with columns: OriginalTweet (str), Sentiment (str)
# # =========================

# import os, math, random, time, json
# from typing import Dict, List, Tuple

# import numpy as np
# import pandas as pd
# from sklearn.model_selection import train_test_split
# from sklearn.metrics import accuracy_score, precision_recall_fscore_support, classification_report, confusion_matrix
# import torch
# from torch.utils.data import Dataset, DataLoader
# from torch.cuda.amp import autocast, GradScaler
# import torch.nn.functional as F

# # ---- deps ----
# # !pip -q install transformers==4.43.3 optuna==3.6.1 wandb==0.17.5 >/dev/null

# import transformers
# from transformers import (
#     AutoTokenizer, AutoModelForSequenceClassification,
#     DataCollatorWithPadding, get_linear_schedule_with_warmup
# )

# os.environ["TRANSFORMERS_NO_TF"] = "1"
# os.environ["TRANSFORMERS_NO_FLAX"] = "1"

# import optuna
# import wandb

# # -------------------------
# # Constants (no CFG, Optuna-only workflow)
# # -------------------------
# MODEL_NAME = "microsoft/mdeberta-v3-base"
# MAX_LEN = 512
# BATCH_SIZE_DEFAULT = 16
# WARMUP_RATIO_DEFAULT = 0.06
# GRAD_CLIP_DEFAULT = 1.0
# USE_AMP = True
# FIXED_EPOCHS=8
# FIXED_PATIENTS=3
# # New W&B project & distinct run base (keeps things separate)
# PROJECT = "adv-dl-p2-deberta-midf1_new_study_2"
# BASE_RUN_NAME = "microsoft/mdeberta-v3-base_full_ex_4_midf1"

# TRIALS = 30
# SEED = 42
# DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
# EARLY_ABORT_F1_E1 = 0.20  # if val macro-F1 after epoch 1 is below this → stop the run

# def set_seed(seed=42):
#     random.seed(seed); np.random.seed(seed)
#     torch.manual_seed(seed); torch.cuda.manual_seed_all(seed)
#     torch.backends.cudnn.deterministic = True
#     torch.backends.cudnn.benchmark = False

# set_seed(SEED)

# # ---- GPU perf toggles (Windows-safe) ----
# torch.backends.cuda.matmul.allow_tf32 = True
# torch.backends.cudnn.allow_tf32 = True
# try:
#     torch.set_float32_matmul_precision("high")
# except Exception:
#     pass

# # -------------------------
# # Label mapping (5-way sentiment)
# # -------------------------
# CANON = {
#     "extremely negative": "extremely negative",
#     "negative": "negative",
#     "neutral": "neutral",
#     "positive": "positive",
#     "extremely positive": "extremely positive",
# }
# ORDER = ["extremely negative","negative","neutral","positive","extremely positive"]
# LABEL2ID = {lab: i for i, lab in enumerate(ORDER)}
# ID2LABEL = {i: lab for lab, i in LABEL2ID.items()}

# def normalize_label(s: str) -> str:
#     s = str(s).strip().lower()
#     s = s.replace("very negative", "extremely negative")
#     s = s.replace("very positive", "extremely positive")
#     s = s.replace("extreme negative", "extremely negative")
#     s = s.replace("extreme positive", "extremely positive")
#     return CANON.get(s, s)

# # -------------------------
# # Expect df_train, df_test in memory
# # -------------------------
# assert "OriginalTweet" in df_train.columns and "Sentiment" in df_train.columns, "df_train missing required columns"
# assert "OriginalTweet" in df_test.columns and "Sentiment" in df_test.columns, "df_test missing required columns"

# def prep_df(df: pd.DataFrame) -> pd.DataFrame:
#     df = df.copy()
#     df = df.dropna(subset=["OriginalTweet", "Sentiment"])
#     df["text"] = df["OriginalTweet"].astype(str).str.strip()
#     df["label_name"] = df["Sentiment"].apply(normalize_label)
#     df = df[df["label_name"].isin(ORDER)].reset_index(drop=True)
#     df["label"] = df["label_name"].map(LABEL2ID)
#     return df[["text", "label", "label_name"]]

# dftrain_ = prep_df(df_train)
# dftest_  = prep_df(df_test)

# train_df, val_df = train_test_split(
#     dftrain_, test_size=0.1, stratify=dftrain_["label"], random_state=SEED
# )
# print(f"Train/Val/Test sizes: {len(train_df)}/{len(val_df)}/{len(dftest_)}")

# # -------------------------
# # Dataset & Collator
# # -------------------------
# class TweetDataset(Dataset):
#     def __init__(self, df: pd.DataFrame, tokenizer: transformers.PreTrainedTokenizerBase, max_len: int):
#         self.texts = df["text"].tolist()
#         self.labels = df["label"].tolist()
#         self.tok = tokenizer
#         self.max_len = max_len
#     def __len__(self): return len(self.texts)
#     def __getitem__(self, idx):
#         enc = self.tok(self.texts[idx], truncation=True, max_length=self.max_len, padding=False)
#         enc["labels"] = self.labels[idx]
#         return {k: torch.tensor(v) for k, v in enc.items()}

# tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)
# train_ds = TweetDataset(train_df, tokenizer, MAX_LEN)
# val_ds   = TweetDataset(val_df, tokenizer, MAX_LEN)
# test_ds  = TweetDataset(dftest_, tokenizer, MAX_LEN)

# # -------------------------
# # Model & Freeze/Unfreeze strategy
# # -------------------------
# def build_model(num_unfreeze_last_layers: int = 4, dropout: float = 0.1):
#     model = AutoModelForSequenceClassification.from_pretrained(
#         MODEL_NAME, num_labels=len(ORDER), id2label=ID2LABEL, label2id=LABEL2ID
#     )
#     # apply head-level dropout knobs
#     model.config.hidden_dropout_prob = float(dropout)
#     model.config.attention_probs_dropout_prob = float(dropout)

#     base = getattr(model, "roberta", None) or getattr(model, "bert", None) or getattr(model, "deberta", None)
#     if base is not None:
#         for p in base.parameters(): p.requires_grad = False
#         if hasattr(base, "encoder") and hasattr(base.encoder, "layer"):
#             k = int(num_unfreeze_last_layers)
#             if k > 0:
#                 for layer in base.encoder.layer[-k:]:
#                     for p in layer.parameters(): p.requires_grad = True
#     for p in model.classifier.parameters(): p.requires_grad = True
#     return model.to(DEVICE)

# # -------------------------
# # Train / Eval utilities
# # -------------------------
# def get_optimizer_scheduler(model, num_training_steps: int, lr: float, weight_decay: float, warmup_ratio: float):
#     no_decay = ["bias", "LayerNorm.weight"]
#     optimizer_grouped_parameters = [
#         {"params": [p for n, p in model.named_parameters() if p.requires_grad and not any(nd in n for nd in no_decay)], "weight_decay": weight_decay},
#         {"params": [p for n, p in model.named_parameters() if p.requires_grad and any(nd in n for nd in no_decay)],  "weight_decay": 0.0},
#     ]
#     # fused AdamW on CUDA if available
#     try:
#         optimizer = torch.optim.AdamW(optimizer_grouped_parameters, lr=lr, weight_decay=weight_decay, fused=(DEVICE=="cuda"))
#     except TypeError:
#         optimizer = torch.optim.AdamW(optimizer_grouped_parameters, lr=lr, weight_decay=weight_decay)
#     num_warmup = int(num_training_steps * warmup_ratio)
#     scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=num_warmup, num_training_steps=num_training_steps)
#     return optimizer, scheduler

# # indices and mid set
# IDX_EXT_NEG = LABEL2ID["extremely negative"]
# IDX_NEG     = LABEL2ID["negative"]
# IDX_NEU     = LABEL2ID["neutral"]
# IDX_POS     = LABEL2ID["positive"]
# IDX_EXT_POS = LABEL2ID["extremely positive"]
# MIDS = [IDX_NEG, IDX_NEU, IDX_POS]

# def make_tier_weights(ratio_mid_ext: float) -> torch.Tensor:
#     """
#     Assigns the same weight to both extremes, and a higher weight to mid classes.
#     Renormalizes so the mean weight is ≈ 1.0 (keeps loss scale stable).
#     """
#     K = len(ORDER)
#     w = np.ones(K, dtype=np.float32)                        # extremes start at 1.0
#     w[[IDX_NEG, IDX_NEU, IDX_POS]] = float(ratio_mid_ext)   # mid classes boosted
#     w = w * (K / w.sum())                                   # mean ~ 1.0
#     return torch.tensor(w, dtype=torch.float, device=DEVICE)

# def evaluate(model, loader) -> Dict[str, float]:
#     model.eval()
#     preds, labels = [], []
#     with torch.no_grad():
#         for batch in loader:
#             batch = {k: v.to(DEVICE, non_blocking=True) for k, v in batch.items()}
#             with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
#                           enabled=(DEVICE == "cuda" and USE_AMP)):
#                 logits = model(**batch).logits
#             preds.extend(torch.argmax(logits, dim=-1).detach().cpu().tolist())
#             labels.extend(batch["labels"].detach().cpu().tolist())
#     acc = accuracy_score(labels, preds)
#     p, r, f1, _ = precision_recall_fscore_support(labels, preds, average="macro", zero_division=0)
#     # mid-class F1
#     p_mid, r_mid, f1_mid, _ = precision_recall_fscore_support(labels, preds, labels=MIDS, average="macro", zero_division=0)
#     return {"acc": acc, "precision": p, "recall": r, "f1": f1, "f1_mid": f1_mid}

# def make_loaders(batch_size: int):
#     collate_fn = DataCollatorWithPadding(tokenizer=tokenizer, pad_to_multiple_of=8)
#     train_loader = DataLoader(train_ds, batch_size=batch_size, shuffle=True,  collate_fn=collate_fn, num_workers=0, pin_memory=True, persistent_workers=False)
#     val_loader   = DataLoader(val_ds,   batch_size=batch_size, shuffle=False, collate_fn=collate_fn, num_workers=0, pin_memory=True, persistent_workers=False)
#     test_loader  = DataLoader(test_ds,  batch_size=batch_size, shuffle=False, collate_fn=collate_fn, num_workers=0, pin_memory=True, persistent_workers=False)
#     return train_loader, val_loader, test_loader

# def train_one_run(hp: Dict) -> Tuple[str, Dict[str, float]]:
#     """
#     hp keys: run_name, num_unfreeze_last_layers, lr, weight_decay, epochs, patience, trial_number,
#              ratio_mid_ext, label_smoothing, warmup_ratio, grad_clip, dropout, batch_size
#     """
#     run_name        = hp["run_name"]
#     num_unfreeze    = int(hp["num_unfreeze_last_layers"])
#     lr              = float(hp["lr"])
#     wd              = float(hp["weight_decay"])
#     epochs          = int(hp.get("epochs",   FIXED_EPOCHS))
#     patience        = int(hp.get("patience", FIXED_PATIENCE))
#     ratio_mid_ext   = float(hp.get("ratio_mid_ext", 1.6))
#     label_smoothing = float(hp.get("label_smoothing", 0.05))
#     warmup_ratio    = float(hp.get("warmup_ratio", WARMUP_RATIO_DEFAULT))
#     grad_clip       = float(hp.get("grad_clip", GRAD_CLIP_DEFAULT))
#     dropout         = float(hp.get("dropout", 0.1))
#     bs              = int(hp.get("batch_size", BATCH_SIZE_DEFAULT))

#     # model + loaders
#     model = build_model(num_unfreeze, dropout=dropout)
#     train_loader, val_loader, _ = make_loaders(bs)

#     total_steps = int(math.ceil(len(train_loader) * epochs))
#     optimizer, scheduler = get_optimizer_scheduler(model, total_steps, lr, wd, warmup_ratio)

#     scaler = GradScaler(enabled=(DEVICE == "cuda" and USE_AMP))
#     best_metric = -1.0
#     no_improve = 0

#     # save to a different folder + name to avoid collisions
#     safe_run_name = run_name.replace("/", "__").replace("\\", "__")
#     ckpt_dir = "checkpoints_midf1"
#     os.makedirs(ckpt_dir, exist_ok=True)
#     best_path = os.path.join(ckpt_dir, f"best_midf1_{safe_run_name}.pt")

#     wandb_run = wandb.init(
#         project=PROJECT,
#         name=run_name,
#         config={
#             "model": MODEL_NAME,
#             "max_len": MAX_LEN,
#             "batch_size": bs,
#             "epochs": epochs,
#             "lr": lr,
#             "weight_decay": wd,
#             "warmup_ratio": warmup_ratio,
#             "grad_clip": grad_clip,
#             "dropout": dropout,
#             "ratio_mid_ext": ratio_mid_ext,
#             "label_smoothing": label_smoothing,
#             "num_unfreeze_last_layers": num_unfreeze,
#             "trial_number": hp.get("trial_number", -1),
#         },
#         reinit=True,
#     )

#     # nicer W&B charts
#     wandb.define_metric("epoch")
#     wandb.define_metric("step")
#     wandb.define_metric("train/*", step_metric="step")
#     wandb.define_metric("val/*",   step_metric="epoch")

#     # print + log trainable params
#     total_params     = sum(p.numel() for p in model.parameters())
#     trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
#     print(f"Trainable params: {trainable_params:,} / {total_params:,} "
#           f"({100.0*trainable_params/total_params:.2f}%) ; unfreeze_last_k={num_unfreeze}")
#     wandb.log({"params/total": total_params,
#                "params/trainable": trainable_params,
#                "params/ratio": trainable_params/max(1,total_params)}, step=0)

#     # class weights for this run
#     class_weights = make_tier_weights(ratio_mid_ext)

#     for epoch in range(epochs):
#         model.train()
#         t0 = time.time()
#         running_loss = 0.0

#         for step, batch in enumerate(train_loader):
#             batch = {k: v.to(DEVICE, non_blocking=True) for k, v in batch.items()}
#             labels = batch.pop("labels")  # we compute weighted loss ourselves

#             optimizer.zero_grad(set_to_none=True)
#             with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),
#                           enabled=(DEVICE == "cuda" and USE_AMP)):
#                 outputs = model(**batch)
#                 logits = outputs.logits
#                 try:
#                     loss = F.cross_entropy(logits, labels, weight=class_weights, label_smoothing=label_smoothing)
#                 except TypeError:
#                     loss = F.cross_entropy(logits, labels, weight=class_weights)

#             scaler.scale(loss).backward()
#             if grad_clip is not None:
#                 scaler.unscale_(optimizer)
#                 torch.nn.utils.clip_grad_norm_(model.parameters(), grad_clip)
#             scaler.step(optimizer); scaler.update(); scheduler.step()
#             running_loss += loss.item()

#             if step % 20 == 0:
#                 wandb.log({"train/loss": loss.item(), "step": step + 1, "epoch": epoch + 1})

#             # periodic console + throughput log (about 10x per epoch)
#             if step % max(1, len(train_loader)//10) == 0 or step == 1:
#                 avg_loss = running_loss / max(1, (step + 1))
#                 elapsed  = time.time() - t0
#                 items    = (step + 1) * bs
#                 itps     = items / max(elapsed, 1e-6)
#                 print(f"[e{epoch+1} b{step+1}/{len(train_loader)}] loss={loss.item():.4f} avg={avg_loss:.4f} it/s={itps:.1f}")
#                 wandb.log({"train/avg_loss_so_far": avg_loss,
#                            "train/items_per_sec": itps,
#                            "step": (epoch * len(train_loader)) + (step + 1),
#                            "epoch": epoch + 1})

#         # epoch-end validation
#         val_metrics = evaluate(model, val_loader)
#         elapsed = time.time() - t0

#         epoch_loss = running_loss / max(1, len(train_loader))
#         current_lr = scheduler.get_last_lr()[0]

#         wandb.log({
#             "train/epoch_loss": epoch_loss,
#             "val/acc": val_metrics["acc"],
#             "val/precision": val_metrics["precision"],
#             "val/recall": val_metrics["recall"],
#             "val/f1": val_metrics["f1"],
#             "val/mid_f1": val_metrics["f1_mid"],   # extra log
#             "lr": current_lr,
#             "time/epoch_sec": elapsed,
#             "epoch": epoch + 1,
#         })
#         # --- Early abort if epoch 1 macro-F1 is too low ---
#         if epoch == 0 and val_metrics["f1"] < EARLY_ABORT_F1_E1:
#             wandb.log({"early_stop/epoch1_low_f1": val_metrics["f1"], "epoch": epoch + 1})
#             print(f"[EARLY-EXIT] epoch=1 | val_f1={val_metrics['f1']:.4f} < {EARLY_ABORT_F1_E1:.2f} → stopping trial")
#             # ensure a checkpoint exists for downstream code
#             if not os.path.exists(best_path):
#                 torch.save(model.state_dict(), best_path)
#                 wandb_run.summary["best_checkpoint_path"] = best_path
#                 wandb_run.summary["best_val_f1"] = val_metrics["f1"]
#                 wandb_run.summary["best_val_mid_f1"] = val_metrics["f1_mid"]
#             break

#         # Early stopping on mid-class F1 (prints stay the same)
#         target_metric = val_metrics["f1_mid"]
#         if target_metric > best_metric:
#             best_metric = target_metric
#             torch.save(model.state_dict(), best_path)
#             no_improve = 0
#             wandb_run.summary["best_val_f1"] = best_metric   # kept same key for compatibility
#             wandb_run.summary["best_val_mid_f1"] = best_metric
#             wandb_run.summary["best_checkpoint_path"] = best_path
#             wandb.log({"val/best_f1_so_far": best_metric, "val/best_epoch": epoch + 1})
#         else:
#             no_improve += 1
#             if no_improve >= patience:
#                 print(f"Early stopping at epoch {epoch+1}")
#                 break

#         # console print line unchanged
#         print(f"Epoch {epoch+1}/{epochs} | "
#               f"loss={epoch_loss:.4f} | "
#               f"val_acc={val_metrics['acc']:.4f} | val_f1={val_metrics['f1']:.4f} | time={elapsed:.1f}s")

#     wandb.finish()

#     # Load best and return path + metrics on val for reference
#     model.load_state_dict(torch.load(best_path, map_location=DEVICE))
#     # rebuild val loader (for safety if using different bs later)
#     _, val_loader, _ = make_loaders(bs)
#     final_val = evaluate(model, val_loader)

#     # store final val in W&B summary for quick sorting
#     if wandb.run is not None:
#         wandb.run.summary["final_val_acc"] = final_val["acc"]
#         wandb.run.summary["final_val_precision"] = final_val["precision"]
#         wandb.run.summary["final_val_recall"] = final_val["recall"]
#         wandb.run.summary["final_val_f1"] = final_val["f1"]
#         wandb.run.summary["final_val_mid_f1"] = final_val["f1_mid"]

#     return best_path, final_val

# # -------------------------
# # Optuna hyperparameter tuning (ALWAYS ON)
# # -------------------------
# FIXED_EPOCHS = 8
# FIXED_PATIENCE = 3

# def objective(trial: optuna.trial.Trial):
#     params = {
#         "run_name": f"{BASE_RUN_NAME}_optuna_trial_{trial.number}",
#         "num_unfreeze_last_layers": trial.suggest_int("num_unfreeze_last_layers", 8, 12),
#         "lr": trial.suggest_float("lr", 1e-5, 1e-4, log=True),
#         "weight_decay": trial.suggest_float("weight_decay", 7e-8, 1e-5, log=True),
#         "batch_size": trial.suggest_categorical("batch_size", [4,8, 16, 32]),
#         "warmup_ratio": trial.suggest_float("warmup_ratio", 0.02, 0.12),
#         "grad_clip": trial.suggest_float("grad_clip", 0.5, 1.5),
#         "dropout": trial.suggest_float("dropout", 0.0, 0.2),
#         "ratio_mid_ext": trial.suggest_float("ratio_mid_ext", 1.2, 2.5),
#         "label_smoothing": trial.suggest_float("label_smoothing", 0.00, 0.12),
#         "epochs": FIXED_EPOCHS,
#         "patience": FIXED_PATIENCE,
#         "trial_number": trial.number,
#     }
#     path, val_metrics = train_one_run(params)
#     print(f"[Trial {trial.number}] f1={val_metrics['f1']:.4f} | "
#           f"unfreeze_k={params['num_unfreeze_last_layers']} lr={params['lr']:.2e} "
#           f"wd={params['weight_decay']:.1e} suggested_bs={params['batch_size']}")
#     trial.report(val_metrics["f1"], step=1)  # study objective stays macro F1
#     return val_metrics["f1"]

# study = optuna.create_study(direction="maximize")
# study.optimize(objective, n_trials=TRIALS, show_progress_bar=True)
# print("Best trial:", study.best_trial.number, "F1:", study.best_value)
# best_params = {"run_name": f"{BASE_RUN_NAME}_best_optuna", **study.best_trial.params}

# # persist best params to a separate folder/name
# os.makedirs("checkpoints_midf1", exist_ok=True)
# with open(os.path.join("checkpoints_midf1", "best_hparams_optuna_midf1_new_code.json"), "w") as f:
#     json.dump({**best_params, "epochs": FIXED_EPOCHS, "patience": FIXED_PATIENCE}, f, indent=2)

# # (Optional) retrain once on the best params to get a clean checkpoint:
# # best_ckpt, _ = train_one_run(best_params)
# # print("Best checkpoint saved to:", best_ckpt)


Train/Val/Test sizes: 37039/4116/3798


[I 2025-08-18 09:53:20,497] A new study created in memory with name: no-name-c77d1994-8070-4841-86c9-c2ee6d7644f4
  0%|          | 0/30 [00:00<?, ?it/s]Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/mdeberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  scaler = GradScaler(enabled=(DEVICE == "cuda" and USE_AMP))
[34m[1mwandb[0m: [32m[41mERROR[0m Failed to detect the name of this notebook. You can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33madishalit1[0m ([33madishalit1-tel-aviv-university[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Trainable params: 78,561,029 / 278,813,189 (28.18%) ; unfreeze_last_k=11


  with autocast(dtype=(torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16),


[e1 b1/1158] loss=1.6729 avg=1.6729 it/s=69.2
[e1 b2/1158] loss=1.6312 avg=1.6521 it/s=108.5
[e1 b116/1158] loss=1.5415 avg=1.6206 it/s=530.5
[e1 b231/1158] loss=1.3327 avg=1.5464 it/s=599.9
[e1 b346/1158] loss=1.3846 avg=1.5029 it/s=627.3
[e1 b461/1158] loss=1.1854 avg=1.4677 it/s=639.8
[e1 b576/1158] loss=1.3628 avg=1.4338 it/s=650.0
[e1 b691/1158] loss=1.1391 avg=1.3957 it/s=654.0
