<center><br><br>
<font size=6>🎓 <b>Advanced Deep Learning - NLP Final Project</b></font><br>
<font size=6>⚖️  <b>Training - microsoft/mdeberta-v3-base EX5</b></font><br>
<font size=5>👥 <b>Group W</b></font><br><br>
<b>Adi Shalit</b>, ID: <code>206628885</code><br>
<b>Gal Gussarsky</b>, ID: <code>206453540</code><br><br>
<font size=4>📘 Course ID: <code>05714184</code></font><br>
<font size=4>📅 Spring 2025</font>
<br><br>
<hr style="width:60%; border:1px solid gray;"></center>


# 📑 Table of Contents

- [Training](#Training)
- [ Load best Model & Test](#Load-Best-Model)




In [None]:
# # === Compatible versions ===
# !pip install --upgrade --force-reinstall --no-cache-dir \
#   transformers==4.43.3 \
#   accelerate==0.30.1 \
#   datasets==2.20.0 \
#   evaluate==0.4.2 \
#   optuna==3.6.1 \
#   wandb==0.17.5 \
#   fsspec==2024.5.0 \
#   gcsfs==2024.5.0


In [None]:
# !pip install transformers==4.44.2 accelerate==0.34.2

# from transformers import Trainer
# print("✅ Trainer import successful")


## Load Dataset & Imports

In [7]:


# # -------------------------
# # Colab Drive setup
# # -------------------------
# drive.mount("/content/drive")
# DRIVE_OUT_DIR = "/content/drive/MyDrive/adv_dl_models2"
DRIVE_OUT_DIR = "adv_dl_models2"
os.makedirs(DRIVE_OUT_DIR, exist_ok=True)

train_path = "Corona_NLP_train_cleaned_translated.csv"
test_path  = "Corona_NLP_test_cleaned_translated.csv"

df_train = pd.read_csv(train_path)
df_test  = pd.read_csv(test_path)

print(df_train.shape, df_test.shape)
df_train.head()


(41157, 7) (3798, 7)


Unnamed: 0,UserName,ScreenName,Location,TweetAt,OriginalTweet,Sentiment,DetectedLang
0,3799,48751,London,16-03-2020,@MeNyrbie @Phil_Gahan @Chrisitv and and,Neutral,en
1,3800,48752,UK,16-03-2020,advice Talk to your neighbours family to excha...,Positive,en
2,3801,48753,Vagabonds,16-03-2020,"covid Australia: Woolworths to give elderly, d...",Positive,en
3,3802,48754,,16-03-2020,My food stock is not the only one which is emp...,Positive,en
4,3803,48755,,16-03-2020,"Me, ready to go at supermarket during the covi...",Extremely Negative,en


# 📊 RoBERTa – Hyperparameter Exploration (Trainer + Optuna + W&B)

In this stage, we continue our experimentation with **RoBERTa-base** for 5-class sentiment classification.  
Based on our previous findings, **a higher number of unfrozen layers** tends to improve performance.  

We now design a structured **Optuna hyperparameter search** to explore:
- Learning rate (`1e-6 → 3e-5`)  
- Weight decay (`1e-6 → 1e-2`)  
- Batch sizes (`[4, 8, 16, 32]`)  
- Number of unfrozen layers (`5 → 12`)  

Training is managed through the **🤗 HuggingFace Trainer API**, with:  
- **W&B logging** for live monitoring  
- **Early stopping** (`patience=4`)  
- **Fixed epochs = 12**  
- Evaluation metric = **Macro F1**  

This approach allows us to balance **model expressiveness** (more unfrozen layers) with **computational efficiency**, while ensuring reproducibility through deterministic seeding.  


# Training

In [3]:
# =========================
# ADV DL – Part B: RoBERTa (Trainer API + Optuna + W&B)
# Freeze base, unfreeze last k; 5-class sentiment
# Expects df_train/df_test already loaded with columns: ["OriginalTweet","Sentiment"]
# =========================

import os, math, random, time, json
from typing import Dict, Any
import numpy as np
import pandas as pd
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, classification_report, confusion_matrix

import torch
from transformers import (
    AutoTokenizer, AutoModelForSequenceClassification,
    DataCollatorWithPadding, TrainingArguments, Trainer,
    EarlyStoppingCallback
)
import optuna
import wandb
from datasets import Dataset, DatasetDict
import numpy as np
import datasets

# Patch datasets' arrow → numpy conversion to be NumPy 2.x safe
old_arrow_array_to_numpy = datasets.formatting.formatting.NumpyArrowExtractor._arrow_array_to_numpy

def safe_arrow_array_to_numpy(self, pa_array):
    return np.asarray(pa_array)  # allows copy when needed

datasets.formatting.formatting.NumpyArrowExtractor._arrow_array_to_numpy = safe_arrow_array_to_numpy


# -------------------------
# Globals
# -------------------------
MODEL_NAME = "roberta-base"
MAX_LEN = 512
PROJECT = "adv-dl2-p1"
BASE_RUN_NAME = "roberta-base_trainer"
TRIALS = 12
SEED = 42
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

def set_seed(seed=42):
    random.seed(seed); np.random.seed(seed)
    torch.manual_seed(seed); torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

set_seed(SEED)

# -------------------------
# Labels (5 classes)
# -------------------------
CANON = {
    "extremely negative": "extremely negative",
    "negative": "negative",
    "neutral": "neutral",
    "positive": "positive",
    "extremely positive": "extremely positive",
}
ORDER = ["extremely negative","negative","neutral","positive","extremely positive"]
LABEL2ID = {lab: i for i, lab in enumerate(ORDER)}
ID2LABEL = {i: lab for lab, i in LABEL2ID.items()}

def normalize_label(s: str) -> str:
    s = str(s).strip().lower()
    s = s.replace("very negative", "extremely negative")
    s = s.replace("very positive", "extremely positive")
    s = s.replace("extreme negative", "extremely negative")
    s = s.replace("extreme positive", "extremely positive")
    return CANON.get(s, s)

# -------------------------
# Expect df_train/df_test preloaded
# -------------------------
assert "OriginalTweet" in df_train.columns and "Sentiment" in df_train.columns, "df_train missing required columns"
assert "OriginalTweet" in df_test.columns and "Sentiment" in df_test.columns, "df_test missing required columns"

def prep_df(df: pd.DataFrame) -> pd.DataFrame:
    df = df.dropna(subset=["OriginalTweet", "Sentiment"]).copy()
    df["text"] = df["OriginalTweet"].astype(str).str.strip()
    df["label_name"] = df["Sentiment"].apply(normalize_label)
    df = df[df["label_name"].isin(ORDER)].reset_index(drop=True)
    df["label"] = df["label_name"].map(LABEL2ID)
    return df[["text","label","label_name"]]

dftrain_ = prep_df(df_train)
dftest_  = prep_df(df_test)

# Stratified split
from sklearn.model_selection import train_test_split
train_df, val_df = train_test_split(
    dftrain_, test_size=0.1, stratify=dftrain_["label"], random_state=SEED
)
print(f"Train/Val/Test sizes: {len(train_df)}/{len(val_df)}/{len(dftest_)}")

# -------------------------
# HF Datasets + Tokenizer
# -------------------------
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)

def tokenize_batch(batch):
    return tokenizer(batch["text"], truncation=True, max_length=MAX_LEN)

ds_train = Dataset.from_pandas(train_df.reset_index(drop=True))
ds_val   = Dataset.from_pandas(val_df.reset_index(drop=True))
ds_test  = Dataset.from_pandas(dftest_.reset_index(drop=True))

ds = DatasetDict({"train": ds_train, "validation": ds_val, "test": ds_test})
ds = ds.remove_columns([c for c in ds["train"].column_names if c not in ["text","label"]])
ds_tok = ds.map(tokenize_batch, batched=True, remove_columns=["text"])
ds_tok = ds_tok.rename_column("label", "labels")   # Trainer expects 'labels'
ds_tok.set_format("torch")

collator = DataCollatorWithPadding(tokenizer=tokenizer)

# -------------------------
# Model + freezing policy
# -------------------------
def build_model(unfreeze_last_k: int = 4) -> AutoModelForSequenceClassification:
    model = AutoModelForSequenceClassification.from_pretrained(
        MODEL_NAME, num_labels=len(ORDER), id2label=ID2LABEL, label2id=LABEL2ID
    )
    base = getattr(model, "roberta", None) or getattr(model, "bert", None)
    if base is not None:
        for p in base.parameters(): p.requires_grad = False  # freeze all
        if hasattr(base, "encoder") and hasattr(base.encoder, "layer") and unfreeze_last_k > 0:
            for layer in base.encoder.layer[-unfreeze_last_k:]:
                for p in layer.parameters(): p.requires_grad = True
    if hasattr(model, "classifier"):
        for p in model.classifier.parameters(): p.requires_grad = True
    return model

# -------------------------
# Metrics (pure sklearn to avoid evaluate/datasets mismatch)
# -------------------------
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    acc = accuracy_score(labels, preds)
    p, r, f1, _ = precision_recall_fscore_support(labels, preds, average="macro", zero_division=0)
    return {
        "accuracy": acc,
        "precision_macro": p,
        "recall_macro": r,
        "f1_macro": f1,
    }

# -------------------------
# W&B helper
# -------------------------
def log_conf_mat_to_wandb(y_true, y_pred, prefix="val"):
    table = wandb.plot.confusion_matrix(
        y_true=y_true,
        preds=y_pred,
        class_names=[ID2LABEL[i] for i in range(len(ID2LABEL))]
    )
    wandb.log({f"{prefix}/confusion_matrix": table})

# -------------------------
# Optuna objective (wraps Trainer)
# -------------------------
FIXED_EPOCHS = 12
PATIENCE = 4
LOG_STEPS = 20

def make_trainer(params: Dict[str, Any], output_dir: str, run_name: str) -> Trainer:
    model = build_model(unfreeze_last_k=int(params["unfreeze_last_k"]))
    args = TrainingArguments(
        output_dir=output_dir,
        evaluation_strategy="epoch",         # canonical name
        save_strategy="epoch",
        load_best_model_at_end=True,
        metric_for_best_model="eval_f1_macro",  # Trainer prefixes eval metrics
        greater_is_better=True,

        num_train_epochs=FIXED_EPOCHS,
        learning_rate=float(params["lr"]),
        weight_decay=float(params["weight_decay"]),
        warmup_ratio=float(params["warmup_ratio"]),
        per_device_train_batch_size=int(params["per_device_train_batch_size"]),
        per_device_eval_batch_size=int(params["per_device_eval_batch_size"]),
        gradient_accumulation_steps=int(params["gradient_accumulation_steps"]),
        fp16=torch.cuda.is_available(),

        logging_steps=LOG_STEPS,
        report_to=["wandb"],
        run_name=run_name,
        seed=SEED,
        save_total_limit=2,
    )
    trainer = Trainer(
        model=model,
        args=args,
        train_dataset=ds_tok["train"],
        eval_dataset=ds_tok["validation"],
        tokenizer=tokenizer,
        data_collator=collator,
        compute_metrics=compute_metrics,
        callbacks=[EarlyStoppingCallback(early_stopping_patience=PATIENCE)]
    )
    return trainer
def objective(trial: optuna.trial.Trial) -> float:
    params = {
        "lr": trial.suggest_float("lr", 1e-6, 3e-5, log=True),  # safer upper bound for more layers
        "weight_decay": trial.suggest_float("weight_decay", 1e-6, 1e-2, log=True),  # updated upper bound
        "warmup_ratio": 0.06,  # fixed
        "batch_size": trial.suggest_categorical("batch_size", [4, 8, 16, 32]),  # single batch size for both train & eval
        "gradient_accumulation_steps": 1,  # fixed
        "unfreeze_last_k": trial.suggest_int("unfreeze_last_k", 5, 12),  # between 5 and 12 layers unfrozen
    }

    run_name = f"{BASE_RUN_NAME}_trial_{trial.number}"
    output_dir = os.path.join(DRIVE_OUT_DIR, run_name)

    wandb.init(project=PROJECT, name=run_name, config=params, reinit=True)

    # pass the same batch_size for both train & eval
    trainer = make_trainer(
        {
            **params,
            "per_device_train_batch_size": params["batch_size"],
            "per_device_eval_batch_size": params["batch_size"],
        },
        output_dir,
        run_name
    )

    _ = trainer.train()  # early stopping inside

    # Evaluate (best model auto-loaded)
    metrics = trainer.evaluate()
    wandb.log({
        "val/accuracy":        metrics.get("eval_accuracy"),
        "val/precision_macro": metrics.get("eval_precision_macro"),
        "val/recall_macro":    metrics.get("eval_recall_macro"),
        "val/f1_macro":        metrics.get("eval_f1_macro"),
    })

    # Confusion matrix on validation
    preds = trainer.predict(ds_tok["validation"])
    y_true = preds.label_ids
    y_pred = np.argmax(preds.predictions, axis=-1)
    log_conf_mat_to_wandb(y_true, y_pred, prefix="val")

    # Save best checkpoint path to W&B summary
    best_dir = trainer.state.best_model_checkpoint
    wandb.summary["best_checkpoint_path"] = best_dir if best_dir else output_dir

    wandb.finish()

    # Return the metric Optuna maximizes
    return float(metrics.get("eval_f1_macro", 0.0))
# -------------------------
# Optuna search (RESUMABLE)
# -------------------------
storage_url = f"sqlite:///{os.path.join(DRIVE_OUT_DIR, f'{BASE_RUN_NAME}_optuna.db')}"
study_name = f"{BASE_RUN_NAME}_study"

try:
    # load_study only needs study_name and storage
    study = optuna.load_study(study_name=study_name, storage=storage_url)
    print(f"Loaded existing study: {study_name}")
except KeyError:
    study = optuna.create_study(
        study_name=study_name,
        storage=storage_url,
        direction="maximize"
    )
    print(f"Created new study: {study_name}")

completed = len([t for t in study.get_trials(deepcopy=False) if t.state == optuna.trial.TrialState.COMPLETE])
remaining = max(0, TRIALS - completed)
print(f"Trials completed: {completed} / {TRIALS} → remaining: {remaining}")

if remaining > 0:
    study.optimize(objective, n_trials=remaining, show_progress_bar=True)
else:
    print("Requested number of trials already completed; skipping optimization.")

print("Best trial:", study.best_trial.number, "F1_macro:", study.best_value)

best_params = study.best_params
best_params["per_device_train_batch_size"] = best_params["batch_size"]
best_params["per_device_eval_batch_size"] = best_params["batch_size"]



# -------------------------
# Retrain best config cleanly
# -------------------------
best_run_name = f"{BASE_RUN_NAME}_best"
best_out_dir = os.path.join(DRIVE_OUT_DIR, best_run_name)
wandb.init(project=PROJECT, name=best_run_name, config=best_params, reinit=True)

trainer = make_trainer(best_params, best_out_dir, best_run_name)
trainer.train()

best_ckpt_dir = trainer.state.best_model_checkpoint or best_out_dir
print("Best checkpoint directory:", best_ckpt_dir)
wandb.summary["best_checkpoint_path"] = best_ckpt_dir

# -------------------------
# # Final TEST evaluation
# # -------------------------
# test_metrics = trainer.evaluate(ds_tok["test"])

# preds_test = trainer.predict(ds_tok["test"])
# y_true_test = preds_test.label_ids
# y_pred_test = np.argmax(preds_test.predictions, axis=-1)

# acc = accuracy_score(y_true_test, y_pred_test)
# p, r, f1, _ = precision_recall_fscore_support(y_true_test, y_pred_test, average="macro", zero_division=0)
# print(f"\nTEST | acc={acc:.4f} | f1_macro={f1:.4f} | precision_macro={p:.4f} | recall_macro={r:.4f}\n")

# print("Per-class report:")
# print(classification_report(
#     y_true_test, y_pred_test,
#     target_names=[ID2LABEL[i] for i in range(len(ORDER))],
#     zero_division=0
# ))

# # Log to W&B
# report = classification_report(
#     y_true_test, y_pred_test,
#     target_names=[ID2LABEL[i] for i in range(len(ORDER))],
#     zero_division=0, output_dict=True
# )

# payload = {
#     "test/accuracy": acc,
#     "test/precision_macro": p,
#     "test/recall_macro": r,
#     "test/f1_macro": f1,
# }
# for cls_name in ORDER:
#     if cls_name in report:
#         payload[f"test/{cls_name}/precision"] = report[cls_name]["precision"]
#         payload[f"test/{cls_name}/recall"]    = report[cls_name]["recall"]
#         payload[f"test/{cls_name}/f1"]        = report[cls_name]["f1-score"]

# wandb.log(payload)
# table = wandb.plot.confusion_matrix(
#     y_true=y_true_test,
#     preds=y_pred_test,
#     class_names=[ID2LABEL[i] for i in range(len(ID2LABEL))]
# )
# wandb.log({"test/confusion_matrix": table})
# wandb.finish()

# # Save the final best model weights to Drive
# final_path = os.path.join(DRIVE_OUT_DIR, f"best_{BASE_RUN_NAME}_trainer.pt")
# torch.save(trainer.model.state_dict(), final_path)
# print("Saved best model state_dict to:", final_path)


Train/Val/Test sizes: 37039/4116/3798


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]



Map:   0%|          | 0/37039 [00:00<?, ? examples/s]

Map:   0%|          | 0/4116 [00:00<?, ? examples/s]

Map:   0%|          | 0/3798 [00:00<?, ? examples/s]

Loaded existing study: roberta-base_trainer_study
Trials completed: 0 / 12 → remaining: 12


  0%|          | 0/12 [00:00<?, ?it/s]

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


Epoch,Training Loss,Validation Loss,Accuracy,Precision Macro,Recall Macro,F1 Macro
1,0.972,1.078238,0.554665,0.566122,0.569002,0.566863
2,0.9244,0.950611,0.608844,0.630563,0.628744,0.621589
3,0.8799,0.907957,0.646501,0.666514,0.658884,0.659155
4,0.798,0.852881,0.680272,0.692792,0.694896,0.690852
5,0.8213,0.828949,0.686346,0.69731,0.701441,0.6961
6,0.5459,0.829353,0.702381,0.715731,0.710187,0.711925
7,0.6706,0.861672,0.693878,0.706778,0.712204,0.703988
8,0.6512,0.857896,0.68999,0.697104,0.708709,0.699799
9,0.5115,0.854962,0.705782,0.716467,0.71728,0.714701
10,0.5518,0.887606,0.701409,0.70737,0.7198,0.70946


VBox(children=(Label(value='0.005 MB of 0.005 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
eval/accuracy,▁▃▅▇▇▇▇▇█▇███
eval/f1_macro,▁▃▅▇▇▇▇▇█▇███
eval/loss,█▄▃▂▁▁▂▂▂▃▂▂▂
eval/precision_macro,▁▄▆▇▇█▇▇█▇███
eval/recall_macro,▁▄▅▇▇▇▇▇█████
eval/runtime,▃▃▃█▅▃▄▃▁▃▃▄▆
eval/samples_per_second,▆▆▆▁▄▆▅▆█▆▆▄▃
eval/steps_per_second,▆▆▆▁▄▆▅▆█▆▆▄▃
test/accuracy,▁
test/f1_macro,▁

0,1
best_checkpoint_path,/content/drive/MyDri...
eval/accuracy,0.71404
eval/f1_macro,0.72309
eval/loss,0.86869
eval/precision_macro,0.72223
eval/recall_macro,0.72701
eval/runtime,9.0067
eval/samples_per_second,456.992
eval/steps_per_second,57.179
test/accuracy,0.71404


[34m[1mwandb[0m: Currently logged in as: [33mgal2361[0m ([33mgal2361-tel-aviv-university[0m). Use [1m`wandb login --relogin`[0m to force relogin


[I 2025-08-14 19:34:00,418] Trial 1 finished with value: 0.7230890768291834 and parameters: {'lr': 5.835883872710637e-06, 'weight_decay': 0.0008934494049267515, 'batch_size': 8, 'unfreeze_last_k': 5}. Best is trial 1 with value: 0.7230890768291834.


Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


Epoch,Training Loss,Validation Loss,Accuracy,Precision Macro,Recall Macro,F1 Macro
1,1.0469,1.145852,0.514577,0.522266,0.53718,0.527494
2,1.0507,0.988141,0.595238,0.621341,0.611203,0.610887
3,0.939,0.928167,0.618319,0.639117,0.635329,0.633951
4,0.9223,0.889277,0.645773,0.661989,0.668792,0.659935
5,0.9222,0.841363,0.669582,0.684143,0.686229,0.681654
6,0.6628,0.813497,0.686346,0.699177,0.697007,0.697096
7,0.8253,0.818073,0.687075,0.698877,0.704734,0.698326
8,0.7559,0.821033,0.68999,0.698864,0.709897,0.700721
9,0.7338,0.791103,0.703596,0.715944,0.714711,0.713306
10,0.6755,0.808498,0.694849,0.704098,0.713795,0.704608


VBox(children=(Label(value='0.005 MB of 0.005 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
eval/accuracy,▁▄▅▆▇▇▇▇█████
eval/f1_macro,▁▄▅▆▇▇▇▇█████
eval/loss,█▅▄▃▂▂▂▂▁▂▁▁▁
eval/precision_macro,▁▅▅▆▇▇▇▇█████
eval/recall_macro,▁▄▅▆▇▇▇██████
eval/runtime,▁▁▅▇▃▅█▆▃▆▅▄█
eval/samples_per_second,██▄▂▆▄▁▃▆▃▄▅▁
eval/steps_per_second,██▄▂▆▄▁▃▆▃▄▅▁
test/accuracy,▁
test/f1_macro,▁

0,1
best_checkpoint_path,/content/drive/MyDri...
eval/accuracy,0.70554
eval/f1_macro,0.71525
eval/loss,0.78247
eval/precision_macro,0.71464
eval/recall_macro,0.72005
eval/runtime,9.2515
eval/samples_per_second,444.902
eval/steps_per_second,55.667
test/accuracy,0.70554


[I 2025-08-14 20:42:45,925] Trial 2 finished with value: 0.715253343829813 and parameters: {'lr': 1.6391467627791417e-06, 'weight_decay': 0.0011002739237243463, 'batch_size': 8, 'unfreeze_last_k': 9}. Best is trial 1 with value: 0.7230890768291834.


Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


Epoch,Training Loss,Validation Loss,Accuracy,Precision Macro,Recall Macro,F1 Macro
1,0.8671,0.875547,0.646501,0.674309,0.646659,0.658089


Epoch,Training Loss,Validation Loss,Accuracy,Precision Macro,Recall Macro,F1 Macro
1,0.8671,0.875547,0.646501,0.674309,0.646659,0.658089
2,0.6991,0.728902,0.714286,0.726535,0.729778,0.723987
3,0.6686,0.668399,0.758746,0.763314,0.774468,0.766116
4,0.6193,0.68984,0.757532,0.763579,0.782287,0.764567
5,0.6263,0.687948,0.770165,0.776087,0.791903,0.776212
6,0.3722,0.686736,0.803207,0.818717,0.805451,0.808361
7,0.4081,0.729888,0.801263,0.806671,0.814147,0.806222
8,0.4738,0.770124,0.815355,0.820916,0.824436,0.820624
9,0.357,0.794534,0.815355,0.82305,0.821505,0.820365
10,0.3666,0.912848,0.814383,0.815042,0.82964,0.818832


VBox(children=(Label(value='0.004 MB of 0.005 MB uploaded\r'), FloatProgress(value=0.7804061126229851, max=1.0…

0,1
eval/accuracy,▁▄▆▆▆▇▇██████
eval/f1_macro,▁▄▆▆▆▇▇██████
eval/loss,▇▃▁▂▂▂▃▄▄█▇██
eval/precision_macro,▁▃▅▅▆█▇██████
eval/recall_macro,▁▄▆▆▆▇▇██████
eval/runtime,▇█▃▁▄▄▆█▇▇▇▇▆
eval/samples_per_second,▂▁▆█▅▅▃▁▂▂▂▂▃
eval/steps_per_second,▂▁▆█▅▅▃▁▂▂▂▂▃
test/accuracy,▁
test/f1_macro,▁

0,1
best_checkpoint_path,/content/drive/MyDri...
eval/accuracy,0.81876
eval/f1_macro,0.82316
eval/loss,0.92253
eval/precision_macro,0.8199
eval/recall_macro,0.8318
eval/runtime,9.0757
eval/samples_per_second,453.517
eval/steps_per_second,56.745
test/accuracy,0.81876


[I 2025-08-14 22:01:22,270] Trial 3 finished with value: 0.8231579905512977 and parameters: {'lr': 5.123320842724201e-06, 'weight_decay': 9.231200936608533e-05, 'batch_size': 8, 'unfreeze_last_k': 11}. Best is trial 3 with value: 0.8231579905512977.


Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


Epoch,Training Loss,Validation Loss,Accuracy,Precision Macro,Recall Macro,F1 Macro
1,0.7434,0.792855,0.705782,0.717302,0.716255,0.713867
2,0.6522,0.588945,0.783285,0.79235,0.788532,0.7891
3,0.5201,0.583333,0.79932,0.804548,0.810992,0.805393
4,0.4332,0.557551,0.810496,0.814159,0.821272,0.815861
5,0.3288,0.55662,0.81171,0.817084,0.823239,0.817247
6,0.2138,0.692708,0.815841,0.823421,0.824121,0.821996
7,0.2756,0.776076,0.816084,0.818308,0.829803,0.821414
8,0.2591,0.817733,0.822157,0.824478,0.830605,0.826407
9,0.1586,0.906152,0.828717,0.825507,0.841344,0.831401
10,0.1289,1.060568,0.819242,0.818871,0.83136,0.822977


VBox(children=(Label(value='0.005 MB of 0.005 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
eval/accuracy,▁▅▆▇▇▇▇██▇███
eval/f1_macro,▁▅▆▇▇▇▇██▇███
eval/loss,▄▁▁▁▁▃▄▄▅▇▇█▇
eval/precision_macro,▁▆▆▇▇█▇██▇███
eval/recall_macro,▁▅▆▇▇▇▇▇█▇███
eval/runtime,▅▆█▅▄▃▄▄▄▅▅▃▁
eval/samples_per_second,▄▃▁▄▅▆▅▅▅▄▄▆█
eval/steps_per_second,▄▃▁▄▅▆▅▅▅▄▄▆█
test/accuracy,▁
test/f1_macro,▁

0,1
best_checkpoint_path,/content/drive/MyDri...
eval/accuracy,0.8292
eval/f1_macro,0.83272
eval/loss,1.07449
eval/precision_macro,0.82917
eval/recall_macro,0.83722
eval/runtime,4.6656
eval/samples_per_second,882.198
eval/steps_per_second,55.298
test/accuracy,0.8292


[I 2025-08-14 22:39:10,471] Trial 4 finished with value: 0.8327161420298526 and parameters: {'lr': 2.116177565568772e-05, 'weight_decay': 0.0013266583844993018, 'batch_size': 16, 'unfreeze_last_k': 10}. Best is trial 4 with value: 0.8327161420298526.


Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


Epoch,Training Loss,Validation Loss,Accuracy,Precision Macro,Recall Macro,F1 Macro
1,1.0512,1.15756,0.512148,0.52423,0.534084,0.526643
2,1.0339,0.98672,0.597911,0.621484,0.613514,0.612937
3,0.9317,0.916884,0.623421,0.642529,0.643142,0.639151
4,0.9418,0.882688,0.646016,0.660888,0.671607,0.660333
5,0.9033,0.827769,0.6776,0.691127,0.694638,0.689558
6,0.6816,0.796118,0.694849,0.708611,0.704615,0.705379
7,0.871,0.791338,0.70311,0.714983,0.718725,0.713911
8,0.7918,0.803207,0.695335,0.705218,0.714801,0.706417
9,0.7483,0.77229,0.707483,0.719208,0.720044,0.717515
10,0.6714,0.786276,0.707969,0.718222,0.725588,0.718049


VBox(children=(Label(value='0.005 MB of 0.005 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
eval/accuracy,▁▄▅▆▇▇█▇█████
eval/f1_macro,▁▄▅▆▇▇█▇█████
eval/loss,█▅▄▃▂▂▁▂▁▁▁▁▁
eval/precision_macro,▁▄▅▆▇▇█▇█████
eval/recall_macro,▁▄▅▆▇▇█▇█████
eval/runtime,▃▃▂▁▃▃▁▄▃▅▄█▅
eval/samples_per_second,▆▆▇█▆▆█▅▆▄▅▁▄
eval/steps_per_second,▆▆▇█▆▆█▅▆▄▅▁▄
test/accuracy,▁
test/f1_macro,▁

0,1
best_checkpoint_path,/content/drive/MyDri...
eval/accuracy,0.7155
eval/f1_macro,0.72521
eval/loss,0.76516
eval/precision_macro,0.72482
eval/recall_macro,0.73057
eval/runtime,9.0188
eval/samples_per_second,456.38
eval/steps_per_second,57.103
test/accuracy,0.7155


[I 2025-08-14 23:52:01,264] Trial 5 finished with value: 0.7252119789539709 and parameters: {'lr': 1.3832420864506064e-06, 'weight_decay': 2.5573086992434022e-05, 'batch_size': 8, 'unfreeze_last_k': 10}. Best is trial 4 with value: 0.8327161420298526.


Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


Epoch,Training Loss,Validation Loss,Accuracy,Precision Macro,Recall Macro,F1 Macro
1,1.0603,1.198467,0.488095,0.501751,0.50617,0.50021
2,1.0573,1.073816,0.555879,0.571923,0.576505,0.570088
3,1.0413,1.006433,0.58552,0.59773,0.608663,0.598998
4,0.9747,0.975373,0.603256,0.616588,0.626096,0.616554
5,0.9637,0.942176,0.616861,0.628516,0.63775,0.630158
6,0.7408,0.929921,0.626093,0.635668,0.644987,0.638502
7,0.9451,0.923141,0.630466,0.642027,0.65198,0.642969
8,0.9186,0.915006,0.633382,0.644586,0.656223,0.645835
9,0.8026,0.89459,0.644315,0.653327,0.662094,0.655723
10,0.8025,0.904813,0.642128,0.652021,0.663348,0.653646


VBox(children=(Label(value='0.005 MB of 0.005 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
eval/accuracy,▁▄▅▆▇▇▇▇█████
eval/f1_macro,▁▄▅▆▇▇▇▇█████
eval/loss,█▅▄▃▂▂▂▁▁▁▁▁▁
eval/precision_macro,▁▄▅▆▇▇▇▇█████
eval/recall_macro,▁▄▅▆▇▇▇▇█████
eval/runtime,▃█▂▁▂▃▄▅▃▅█▄▄
eval/samples_per_second,▆▁▇█▇▆▅▄▆▃▁▅▅
eval/steps_per_second,▆▁▇█▇▆▅▄▆▃▁▅▅
test/accuracy,▁
test/f1_macro,▁

0,1
best_checkpoint_path,/content/drive/MyDri...
eval/accuracy,0.65015
eval/f1_macro,0.66169
eval/loss,0.89367
eval/precision_macro,0.66045
eval/recall_macro,0.6682
eval/runtime,8.8685
eval/samples_per_second,464.113
eval/steps_per_second,58.07
test/accuracy,0.65015


[I 2025-08-15 00:39:39,408] Trial 6 finished with value: 0.6616919266589616 and parameters: {'lr': 1.9943974355510454e-06, 'weight_decay': 4.239212460215961e-06, 'batch_size': 8, 'unfreeze_last_k': 5}. Best is trial 4 with value: 0.8327161420298526.


Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


Epoch,Training Loss,Validation Loss,Accuracy,Precision Macro,Recall Macro,F1 Macro
1,0.9204,0.919224,0.632896,0.656238,0.634036,0.640768
2,0.7009,0.75935,0.70967,0.717965,0.719458,0.717644
3,0.7019,0.809576,0.687561,0.700733,0.713801,0.698638
4,0.57,0.73124,0.730807,0.736123,0.74906,0.738482
5,0.4826,0.729205,0.748785,0.762394,0.760077,0.757755
6,0.3655,0.719578,0.762148,0.772645,0.769026,0.769253
7,0.3731,0.743991,0.764577,0.776139,0.771552,0.771904
8,0.269,0.832721,0.755831,0.761845,0.772411,0.763516
9,0.2525,0.849646,0.759232,0.768776,0.769447,0.766702
10,0.1887,0.898493,0.768222,0.772241,0.779673,0.774207


VBox(children=(Label(value='0.005 MB of 0.032 MB uploaded\r'), FloatProgress(value=0.14241882633303546, max=1.…

0,1
eval/accuracy,▁▅▄▆▇▇█▇▇████
eval/f1_macro,▁▅▄▆▇██▇▇████
eval/loss,█▂▄▁▁▁▂▅▅▇▇█▇
eval/precision_macro,▁▄▄▆▇██▇▇████
eval/recall_macro,▁▅▅▇▇▇██▇████
eval/runtime,▂▃▄▄▁▄▁▄█▆▆▁▅
eval/samples_per_second,▇▆▅▅█▅█▅▁▃▃█▄
eval/steps_per_second,▇▆▅▅█▅█▅▁▃▃█▄
test/accuracy,▁
test/f1_macro,▁

0,1
best_checkpoint_path,/content/drive/MyDri...
eval/accuracy,0.77235
eval/f1_macro,0.7785
eval/loss,0.89479
eval/precision_macro,0.78005
eval/recall_macro,0.77872
eval/runtime,2.915
eval/samples_per_second,1412.03
eval/steps_per_second,44.255
test/accuracy,0.77235


[I 2025-08-15 00:55:05,309] Trial 7 finished with value: 0.7785041186605548 and parameters: {'lr': 2.610632474755213e-05, 'weight_decay': 0.0006673848938398822, 'batch_size': 32, 'unfreeze_last_k': 6}. Best is trial 4 with value: 0.8327161420298526.


Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


Epoch,Training Loss,Validation Loss,Accuracy,Precision Macro,Recall Macro,F1 Macro
1,1.0991,1.139881,0.516035,0.544789,0.527688,0.528454
2,0.9294,0.955267,0.611516,0.640953,0.625935,0.62553
3,0.9181,0.883455,0.650875,0.667129,0.662939,0.664017
4,0.87,0.827566,0.682216,0.69913,0.69223,0.693861
5,0.7691,0.822723,0.694606,0.706135,0.707371,0.704896
6,0.567,0.824898,0.713071,0.726286,0.721222,0.722429
7,0.6346,0.82969,0.712828,0.7264,0.723155,0.721668
8,0.602,0.879458,0.710155,0.718267,0.728095,0.719649
9,0.6493,0.857505,0.728134,0.73897,0.735374,0.736339
10,0.8312,0.929778,0.722303,0.731385,0.73801,0.730637


VBox(children=(Label(value='0.005 MB of 0.005 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
eval/accuracy,▁▄▅▆▇▇▇▇█████
eval/f1_macro,▁▄▆▇▇▇▇▇█████
eval/loss,█▄▂▁▁▁▁▂▂▃▃▃▃
eval/precision_macro,▁▄▅▇▇██▇█████
eval/recall_macro,▁▄▅▆▇▇▇██████
eval/runtime,▇▅█▅▂█▂▁▁▁▂▃▅
eval/samples_per_second,▁▄▁▄▇▁▇███▇▆▄
eval/steps_per_second,▁▄▁▄▇▁▇███▇▆▄
test/accuracy,▁
test/f1_macro,▁

0,1
best_checkpoint_path,/content/drive/MyDri...
eval/accuracy,0.72935
eval/f1_macro,0.73741
eval/loss,0.92492
eval/precision_macro,0.73713
eval/recall_macro,0.74231
eval/runtime,17.2946
eval/samples_per_second,237.993
eval/steps_per_second,59.498
test/accuracy,0.72935


[I 2025-08-15 03:08:04,260] Trial 8 finished with value: 0.7374090216508955 and parameters: {'lr': 1.7113059157800417e-06, 'weight_decay': 0.001555166316467969, 'batch_size': 4, 'unfreeze_last_k': 9}. Best is trial 4 with value: 0.8327161420298526.


Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


Epoch,Training Loss,Validation Loss,Accuracy,Precision Macro,Recall Macro,F1 Macro
1,1.0311,1.054422,0.601798,0.653249,0.590407,0.601471
2,0.6187,0.871067,0.677114,0.692597,0.697299,0.688367
3,0.7635,0.922238,0.713071,0.72821,0.724411,0.722532
4,0.839,0.913123,0.738824,0.753482,0.744495,0.746804
5,0.6302,1.049894,0.747328,0.754647,0.759797,0.753856
6,0.4784,1.128367,0.775996,0.790187,0.775669,0.781703
7,0.5815,1.340178,0.746356,0.756416,0.762277,0.754238
8,0.7171,1.419767,0.76725,0.773014,0.776471,0.773564
9,0.5238,1.395127,0.776239,0.781394,0.785156,0.782481
10,0.2988,1.568707,0.767979,0.772925,0.779721,0.774306


VBox(children=(Label(value='0.005 MB of 0.005 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
eval/accuracy,▁▄▅▆▇█▇██████
eval/f1_macro,▁▄▆▇▇█▇██████
eval/loss,▃▁▁▁▃▃▅▆▆███▆
eval/precision_macro,▁▃▅▆▆█▆▇█▇█▇█
eval/recall_macro,▁▅▆▇▇█▇██████
eval/runtime,▁▃▃▂▂▄██▇▇▇▇▇
eval/samples_per_second,█▆▆▇▆▅▁▁▂▂▂▂▂
eval/steps_per_second,█▆▆▇▆▅▁▁▂▂▂▂▂
test/accuracy,▁
test/f1_macro,▁

0,1
best_checkpoint_path,/content/drive/MyDri...
eval/accuracy,0.77624
eval/f1_macro,0.78248
eval/loss,1.39513
eval/precision_macro,0.78139
eval/recall_macro,0.78516
eval/runtime,17.7644
eval/samples_per_second,231.699
eval/steps_per_second,57.925
test/accuracy,0.77624


[I 2025-08-15 04:52:02,621] Trial 9 finished with value: 0.7824809548211282 and parameters: {'lr': 1.1084018426692404e-05, 'weight_decay': 2.474978389432081e-06, 'batch_size': 4, 'unfreeze_last_k': 6}. Best is trial 4 with value: 0.8327161420298526.


Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


Epoch,Training Loss,Validation Loss,Accuracy,Precision Macro,Recall Macro,F1 Macro
1,1.0285,1.03125,0.608601,0.642721,0.603249,0.611468
2,0.6594,0.880433,0.672012,0.688605,0.689598,0.683577
3,0.733,0.962123,0.689504,0.709995,0.703495,0.701779
4,0.6553,0.93799,0.734694,0.761241,0.730076,0.740204
5,0.7811,1.119218,0.72449,0.731007,0.745081,0.732086
6,0.3297,1.279726,0.75413,0.765458,0.76024,0.761383
7,0.886,1.349179,0.738824,0.749063,0.754582,0.748237
8,0.7972,1.527275,0.745627,0.753706,0.761328,0.754478
9,0.6514,1.528935,0.759475,0.766999,0.768055,0.766608
10,0.4414,1.7371,0.743683,0.748094,0.760741,0.750883


VBox(children=(Label(value='0.005 MB of 0.032 MB uploaded\r'), FloatProgress(value=0.14247439866634912, max=1.…

0,1
eval/accuracy,▁▄▅▇▆█▇▇█▇███
eval/f1_macro,▁▄▅▇▆█▇▇█▇███
eval/loss,▂▁▂▁▃▄▅▆▆███▆
eval/precision_macro,▁▄▅█▆█▇▇█▇███
eval/recall_macro,▁▅▅▆▇█▇██████
eval/runtime,▂▂▁█▃▂▄▅▂▁▄▃▂
eval/samples_per_second,▇▇█▁▆▇▅▄▇█▄▆▇
eval/steps_per_second,▇▇█▁▆▇▅▄▇█▄▆▇
test/accuracy,▁
test/f1_macro,▁

0,1
best_checkpoint_path,/content/drive/MyDri...
eval/accuracy,0.75948
eval/f1_macro,0.76661
eval/loss,1.52893
eval/precision_macro,0.767
eval/recall_macro,0.76806
eval/runtime,17.6731
eval/samples_per_second,232.897
eval/steps_per_second,58.224
test/accuracy,0.75948


[I 2025-08-15 06:27:32,896] Trial 10 finished with value: 0.7666080469440132 and parameters: {'lr': 1.3091234039885813e-05, 'weight_decay': 2.5215428074811555e-05, 'batch_size': 4, 'unfreeze_last_k': 5}. Best is trial 4 with value: 0.8327161420298526.


Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


Epoch,Training Loss,Validation Loss,Accuracy,Precision Macro,Recall Macro,F1 Macro
1,0.7,0.656506,0.763848,0.777381,0.764887,0.770219
2,0.6155,0.52376,0.810253,0.81488,0.822592,0.816656
3,0.4447,0.535911,0.821914,0.837986,0.816906,0.825056
4,0.3545,0.557746,0.816812,0.816598,0.831938,0.822007
5,0.3398,0.584879,0.818999,0.817506,0.837052,0.823511
6,0.1834,0.693131,0.837707,0.837085,0.850727,0.841776
7,0.2064,0.741163,0.83965,0.836337,0.8557,0.842721
8,0.2021,0.794689,0.850826,0.848953,0.86154,0.8543
9,0.1292,0.757042,0.843052,0.841507,0.854244,0.846963
10,0.1189,0.845469,0.856657,0.856183,0.864783,0.860096


VBox(children=(Label(value='0.005 MB of 0.005 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
eval/accuracy,▁▄▅▅▅▇▇█▇████
eval/f1_macro,▁▅▅▅▅▇▇█▇████
eval/loss,▃▁▁▂▂▄▅▆▅▆██▆
eval/precision_macro,▁▄▆▄▅▆▆▇▇█▇██
eval/recall_macro,▁▅▅▆▆▇▇█▇████
eval/runtime,▂▁▂▂▃▄▃▂▅▆█▄▃
eval/samples_per_second,▇█▇▇▅▅▆▇▃▃▁▅▆
eval/steps_per_second,▇█▇▇▅▅▆▇▃▃▁▅▆
test/accuracy,▁
test/f1_macro,▁

0,1
best_checkpoint_path,/content/drive/MyDri...
eval/accuracy,0.85666
eval/f1_macro,0.8601
eval/loss,0.84547
eval/precision_macro,0.85618
eval/recall_macro,0.86478
eval/runtime,4.9113
eval/samples_per_second,838.071
eval/steps_per_second,52.532
test/accuracy,0.85666


[I 2025-08-15 07:11:01,240] Trial 11 finished with value: 0.8600960158069137 and parameters: {'lr': 2.960627147251281e-05, 'weight_decay': 0.008806331558971375, 'batch_size': 16, 'unfreeze_last_k': 12}. Best is trial 11 with value: 0.8600960158069137.


Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


Epoch,Training Loss,Validation Loss,Accuracy,Precision Macro,Recall Macro,F1 Macro
1,0.6546,0.701455,0.749271,0.778796,0.739522,0.753022
2,0.6034,0.518303,0.81414,0.824495,0.816419,0.819664
3,0.3985,0.599315,0.810496,0.825688,0.818536,0.818176
4,0.3662,0.521449,0.828474,0.829179,0.840511,0.833833
5,0.2772,0.551775,0.826045,0.832819,0.835618,0.832056
6,0.1495,0.647307,0.833576,0.836618,0.84495,0.838848
7,0.192,0.829124,0.836249,0.83482,0.852045,0.840996
8,0.2424,0.841032,0.844266,0.843401,0.855675,0.848347
9,0.0818,0.881253,0.836735,0.840671,0.843932,0.840825
10,0.1036,0.956057,0.847182,0.846053,0.857261,0.85104


VBox(children=(Label(value='0.005 MB of 0.005 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
eval/accuracy,▁▆▅▇▆▇▇█▇████
eval/f1_macro,▁▆▆▇▇▇▇█▇████
eval/loss,▄▁▂▁▁▃▅▅▆▇▇█▇
eval/precision_macro,▁▆▆▆▆▇▇▇▇████
eval/recall_macro,▁▆▆▇▇▇██▇████
eval/runtime,▁▃▂▃▆▆▄▂▃▃▄██
eval/samples_per_second,█▆▇▆▃▃▅▇▆▆▅▁▁
eval/steps_per_second,█▆▇▆▃▃▅▇▆▆▅▁▁
test/accuracy,▁
test/f1_macro,▁

0,1
best_checkpoint_path,/content/drive/MyDri...
eval/accuracy,0.84888
eval/f1_macro,0.85281
eval/loss,0.96977
eval/precision_macro,0.8487
eval/recall_macro,0.85854
eval/runtime,5.0738
eval/samples_per_second,811.226
eval/steps_per_second,50.849
test/accuracy,0.84888


[I 2025-08-15 07:54:33,513] Trial 12 finished with value: 0.852805045977085 and parameters: {'lr': 2.892824570696833e-05, 'weight_decay': 0.00979168138442126, 'batch_size': 16, 'unfreeze_last_k': 12}. Best is trial 11 with value: 0.8600960158069137.
Best trial: 11 F1_macro: 0.8600960158069137


Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


KeyError: 'warmup_ratio'

# ✅ Final Training – Best Optuna Trials

We selected the **top-performing configurations** from our Optuna study for RoBERTa-base with high unfrozen layers.  

### 🔬 Best Trials and Hyperparameters
- **Trial 12** → `{'lr': 2.8928e-05, 'weight_decay': 9.79e-03, 'batch_size': 16, 'unfreeze_last_k': 12}`  
  → **F1_macro = 0.8528**  

- **Trial 11** → `{'lr': 2.9606e-05, 'weight_decay': 8.81e-03, 'batch_size': 16, 'unfreeze_last_k': 12}`  
  → **F1_macro = 0.8601**  🏆 **Best**  

- **Trial 4** → `{'lr': 2.1162e-05, 'weight_decay': 1.33e-03, 'batch_size': 16, 'unfreeze_last_k': 10}`  
  → **F1_macro = 0.8327**  

### 📈 Summary
- The **best-performing trial was #11**, achieving **F1_macro = 0.8601**,  
  confirming that **larger numbers of unfrozen layers (k=12)** and carefully tuned learning rates yield the strongest results.  
- Both Trial 11 and Trial 12 converge to high performance with similar hyperparameters,  
  while Trial 4 (with only 10 unfrozen layers) lagged behind.  

➡️ We now compare all 3 models with **final evaluation**.  


# Load Best Model

In [7]:
import os
import torch
import pandas as pd
from transformers import RobertaTokenizer, RobertaForSequenceClassification, Trainer, TrainingArguments, EarlyStoppingCallback
from datasets import Dataset
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, classification_report
from google.colab import drive

# ====== DRIVE SETUP ======
# drive.mount("/content/drive")
# SAVE_DIR = "/content/drive/MyDrive/adv_dl_models_final2_best"
SAVE_DIR = "adv_dl_models_final2_best"
os.makedirs(SAVE_DIR, exist_ok=True)

# ====== CLEANING FUNCTION ======
def prep_df(df):
    # Drop NaNs in text and label column
    df = df.dropna(subset=["OriginalTweet", "Sentiment"]).copy()
    # Make sure text is string
    df["OriginalTweet"] = df["OriginalTweet"].astype(str).str.strip()
    return df

df_train = prep_df(df_train)
df_test = prep_df(df_test)

# ====== MAP SENTIMENT TO LABELS ======
label2id = {label: idx for idx, label in enumerate(df_train["Sentiment"].unique())}
id2label = {idx: label for label, idx in label2id.items()}

df_train["label"] = df_train["Sentiment"].map(label2id)
df_test["label"]  = df_test["Sentiment"].map(label2id)

# ====== SPLIT ======
df_train_split, df_val_split = train_test_split(
    df_train,
    test_size=0.1,
    random_state=42,
    stratify=df_train["label"]
)

# ====== TOKENIZER ======
tokenizer = RobertaTokenizer.from_pretrained("roberta-base")

def tokenize(batch):
    return tokenizer(batch["OriginalTweet"], padding="max_length", truncation=True)

train_dataset = Dataset.from_pandas(df_train_split).map(tokenize, batched=True)
val_dataset   = Dataset.from_pandas(df_val_split).map(tokenize, batched=True)
test_dataset  = Dataset.from_pandas(df_test).map(tokenize, batched=True)

for ds in [train_dataset, val_dataset, test_dataset]:
    ds.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])

# ====== METRICS ======
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = torch.argmax(torch.tensor(logits), dim=-1)
    acc = accuracy_score(labels, preds)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='weighted')
    return {"accuracy": acc, "precision": precision, "recall": recall, "f1": f1}

# ====== BEST PARAMS ======
best_params_list = [
    {'lr': 2.892824570696833e-05, 'weight_decay': 0.00979168138442126, 'batch_size': 16, 'unfreeze_last_k': 12},
    {'lr': 2.960627147251281e-05, 'weight_decay': 0.008806331558971375, 'batch_size': 16, 'unfreeze_last_k': 12},
    {'lr': 2.116177565568772e-05, 'weight_decay': 0.0013266583844993018, 'batch_size': 16, 'unfreeze_last_k': 10}
]

# ====== TRAIN/EVAL LOOP ======
for i, params in enumerate(best_params_list, start=1):
    print(f"\n=== Training model {i} with params: {params} ===")

    # Load base model
    model = RobertaForSequenceClassification.from_pretrained(
        "roberta-base",
        num_labels=len(label2id),
        id2label=id2label,
        label2id=label2id
    )

    # Freeze all except last k layers
    for param in model.roberta.parameters():
        param.requires_grad = False
    for layer in model.roberta.encoder.layer[-params["unfreeze_last_k"]:]:
        for param in layer.parameters():
            param.requires_grad = True

    # Training args
    training_args = TrainingArguments(
        output_dir=f"./results_model_{i}",
        evaluation_strategy="epoch",
        save_strategy="epoch",
        save_total_limit=1,
        learning_rate=params["lr"],
        per_device_train_batch_size=params["batch_size"],
        per_device_eval_batch_size=params["batch_size"],
        weight_decay=params["weight_decay"],
        num_train_epochs=12,
        load_best_model_at_end=True,
        metric_for_best_model="f1",
        greater_is_better=True,
        logging_dir=f"./logs_model_{i}",
        logging_strategy="epoch",
        seed=42
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=val_dataset,
        tokenizer=tokenizer,
        compute_metrics=compute_metrics,
        callbacks=[EarlyStoppingCallback(early_stopping_patience=4)]
    )

    # Train
    trainer.train()

    # Evaluate on test
    test_metrics = trainer.evaluate(test_dataset)
    print(f"Test metrics for model {i}: {test_metrics}")

    # Per-class classification report
    preds_output = trainer.predict(test_dataset)
    y_true = preds_output.label_ids
    y_pred = preds_output.predictions.argmax(axis=-1)
    print(f"\nClassification report for model {i}:\n")
    print(classification_report(
        y_true, y_pred,
        target_names=[id2label[idx] for idx in sorted(id2label.keys())],
        digits=4
    ))

    # Save model weights
    save_path = os.path.join(SAVE_DIR, f"roberta_base_best_set{i}.pt")
    torch.save(model.state_dict(), save_path)
    print(f"Model {i} saved to {save_path}")


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).




Map:   0%|          | 0/37039 [00:00<?, ? examples/s]

Map:   0%|          | 0/4116 [00:00<?, ? examples/s]

Map:   0%|          | 0/3798 [00:00<?, ? examples/s]


=== Training model 1 with params: {'lr': 2.892824570696833e-05, 'weight_decay': 0.00979168138442126, 'batch_size': 16, 'unfreeze_last_k': 12} ===


Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,0.8942,0.635982,0.757775,0.761935,0.757775,0.756523
2,0.5621,0.583375,0.780612,0.788107,0.780612,0.779218
3,0.4241,0.53206,0.813168,0.820613,0.813168,0.81279
4,0.3386,0.606171,0.826045,0.831789,0.826045,0.826484
5,0.2743,0.574704,0.842323,0.845523,0.842323,0.842639
6,0.2326,0.702069,0.838678,0.840968,0.838678,0.838295
7,0.2064,0.706372,0.836978,0.838296,0.836978,0.836685
8,0.1703,0.823484,0.837707,0.837741,0.837707,0.83716
9,0.1462,0.83045,0.849611,0.85127,0.849611,0.849209
10,0.1137,0.946641,0.84208,0.84428,0.84208,0.841716


Test metrics for model 1: {'eval_loss': 0.9277804493904114, 'eval_accuracy': 0.8267509215376514, 'eval_precision': 0.8278203596672534, 'eval_recall': 0.8267509215376514, 'eval_f1': 0.8260394900565873, 'eval_runtime': 30.0954, 'eval_samples_per_second': 126.199, 'eval_steps_per_second': 7.908, 'epoch': 12.0}

Classification report for model 1:

                    precision    recall  f1-score   support

           Neutral     0.8077    0.8481    0.8274       619
          Positive     0.8181    0.7930    0.8054       947
Extremely Negative     0.8050    0.9274    0.8619       592
          Negative     0.8456    0.7733    0.8078      1041
Extremely Positive     0.8557    0.8514    0.8536       599

          accuracy                         0.8268      3798
         macro avg     0.8264    0.8387    0.8312      3798
      weighted avg     0.8278    0.8268    0.8260      3798

Model 1 saved to /content/drive/MyDrive/adv_dl_models_final2_best/roberta_base_best_set1.pt

=== Training model

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,0.8929,0.62528,0.761905,0.768574,0.761905,0.761326
2,0.5638,0.57593,0.792517,0.798034,0.792517,0.791802
3,0.4259,0.472377,0.838678,0.841494,0.838678,0.838933
4,0.3299,0.536025,0.837949,0.841105,0.837949,0.837726
5,0.2758,0.579265,0.85277,0.855236,0.85277,0.853175
6,0.2366,0.669635,0.844509,0.845272,0.844509,0.844414
7,0.2024,0.731183,0.838921,0.839094,0.838921,0.838673
8,0.1715,0.816128,0.837707,0.838238,0.837707,0.837081
9,0.1398,0.835213,0.853013,0.855074,0.853013,0.8526


Test metrics for model 2: {'eval_loss': 0.6522133946418762, 'eval_accuracy': 0.8354397051079515, 'eval_precision': 0.839288564650419, 'eval_recall': 0.8354397051079515, 'eval_f1': 0.8358491465662637, 'eval_runtime': 30.0756, 'eval_samples_per_second': 126.282, 'eval_steps_per_second': 7.913, 'epoch': 9.0}

Classification report for model 2:

                    precision    recall  f1-score   support

           Neutral     0.8910    0.8320    0.8605       619
          Positive     0.7811    0.8627    0.8199       947
Extremely Negative     0.8278    0.8767    0.8515       592
          Negative     0.8223    0.8136    0.8180      1041
Extremely Positive     0.9188    0.7930    0.8513       599

          accuracy                         0.8354      3798
         macro avg     0.8482    0.8356    0.8402      3798
      weighted avg     0.8393    0.8354    0.8358      3798

Model 2 saved to /content/drive/MyDrive/adv_dl_models_final2_best/roberta_base_best_set2.pt

=== Training model 3

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,0.9501,0.858888,0.662779,0.689754,0.662779,0.657556
2,0.6451,0.633672,0.771866,0.780957,0.771866,0.770473
3,0.4904,0.588135,0.79033,0.800542,0.79033,0.791005
4,0.3921,0.520049,0.831633,0.834469,0.831633,0.832227
5,0.3169,0.57596,0.830661,0.832655,0.830661,0.830571
6,0.2597,0.613129,0.834791,0.837347,0.834791,0.835327
7,0.2228,0.797759,0.81414,0.820095,0.81414,0.813567
8,0.1943,0.805075,0.827745,0.828603,0.827745,0.827017
9,0.1682,0.922419,0.825316,0.826894,0.825316,0.824787
10,0.1386,1.022027,0.829689,0.831191,0.829689,0.829142


Test metrics for model 3: {'eval_loss': 0.7150322198867798, 'eval_accuracy': 0.8006845708267509, 'eval_precision': 0.8050816509396402, 'eval_recall': 0.8006845708267509, 'eval_f1': 0.8010498145434137, 'eval_runtime': 30.0834, 'eval_samples_per_second': 126.249, 'eval_steps_per_second': 7.911, 'epoch': 10.0}

Classification report for model 3:

                    precision    recall  f1-score   support

           Neutral     0.8664    0.7754    0.8184       619
          Positive     0.7483    0.8099    0.7779       947
Extremely Negative     0.8052    0.8868    0.8441       592
          Negative     0.7687    0.7791    0.7739      1041
Extremely Positive     0.8945    0.7646    0.8245       599

          accuracy                         0.8007      3798
         macro avg     0.8166    0.8032    0.8077      3798
      weighted avg     0.8051    0.8007    0.8010      3798

Model 3 saved to /content/drive/MyDrive/adv_dl_models_final2_best/roberta_base_best_set3.pt


# 📊 Final RoBERTa Model Comparison

| Model | Learning Rate | Weight Decay | Batch Size | Unfrozen Layers | Epochs | Accuracy | Precision | Recall | F1 Score |
|-------|---------------|--------------|------------|-----------------|--------|----------|-----------|--------|----------|
| **1** | 2.89e-05      | 9.79e-03     | 16         | 12              | 12     | 0.8268   | 0.8278    | 0.8268 | 0.8260   |
| **2 🏆** | 2.96e-05      | 8.81e-03     | 16         | 12              | 9      | **0.8354** | **0.8393** | **0.8354** | **0.8358** |
| **3** | 2.12e-05      | 1.33e-03     | 16         | 10              | 10     | 0.8007   | 0.8051    | 0.8007 | 0.8010   |

---

### 🔎 Key Insights
- **Model 2** is the best: highest **Accuracy (83.5%)** and **F1 (0.836)**.  
- **Model 1** is strong but slightly behind (82.6%).  
- **Model 3** lags (80.1%), confirming that **fewer unfrozen layers (10) reduce performance**.
