<center><br><br>
<font size=6>🎓 <b>Advanced Deep Learning - NLP Final Project</b></font><br>
<font size=6>⚖️  <b>Training - microsoft/mdeberta-v3-base EX4</b></font><br>
<font size=5>👥 <b>Group W</b></font><br><br>
<b>Adi Shalit</b>, ID: <code>206628885</code><br>
<b>Gal Gussarsky</b>, ID: <code>206453540</code><br><br>
<font size=4>📘 Course ID: <code>05714184</code></font><br>
<font size=4>📅 Spring 2025</font>
<br><br>
<hr style="width:60%; border:1px solid gray;"></center>


# 📑 Table of Contents

- [Training](#Training)
- [Load best Model & Test](#Load-Best-Model)




## Load Dataset

In [1]:
import pandas as pd


# Paths to your CSV files
train_path = "Corona_NLP_train_cleaned_translated.csv"
test_path  = "Corona_NLP_test_cleaned_translated.csv"

# Load into DataFrames
df_train = pd.read_csv(train_path)
df_test = pd.read_csv(test_path)

# Check first rows
print(df_train.head())
print(df_test.head())


   UserName  ScreenName   Location     TweetAt  \
0      3799       48751     London  16-03-2020   
1      3800       48752         UK  16-03-2020   
2      3801       48753  Vagabonds  16-03-2020   
3      3802       48754        NaN  16-03-2020   
4      3803       48755        NaN  16-03-2020   

                                       OriginalTweet           Sentiment  \
0            @MeNyrbie @Phil_Gahan @Chrisitv and and             Neutral   
1  advice Talk to your neighbours family to excha...            Positive   
2  covid Australia: Woolworths to give elderly, d...            Positive   
3  My food stock is not the only one which is emp...            Positive   
4  Me, ready to go at supermarket during the covi...  Extremely Negative   

  DetectedLang  
0           en  
1           en  
2           en  
3           en  
4           en  
   UserName  ScreenName             Location     TweetAt  \
0         1       44953                  NYC  02-03-2020   
1         2       44

# Training

<font size=6>📊 <b>Training — RoBERTa-base (Exercise‑4 Style)</b></font><br>

**Why RoBERTa?**  
RoBERTa (“Robustly Optimized BERT”) is a 12‑layer Transformer **encoder** (hidden size 768, 12 heads) trained with **dynamic masking**, **longer sequences**, **bigger batches**, and **no NSP**. In practice, it’s a strong monolingual baseline that usually converges fast and gives reliable sentiment results.

---

## 🎯 Goal
Build a **clean monolingual baseline** on our 5 sentiment labels, using a **custom training loop** so we can control freezing, logging, and early stopping just like Ex.4.

---

## 🧩 What we do 
- 🧠 **Backbone**: `roberta-base` + a linear **classification head** (5 classes).  
- 🧊 **Freeze → Unfreeze**: freeze the base and **unfreeze the last *k* encoder blocks** (k is a hyperparameter).  
- ⚙️ **Custom loop**: PyTorch + AMP (mixed precision), gradient clipping, linear LR warmup/decay.  
- 🛑 **Early stopping**: on **validation macro‑F1** to avoid overfitting.  
- 🔎 **Optuna search**: tune **LR**, **weight decay**, **epochs**, **patience**, and **unfreeze depth (k)**.  
- 📒 **W&B tracking**: log losses, metrics, best checkpoint path, and final test scores.  
- 🧪 **Data split**: stratified Train/Val from `df_train`, then evaluate once on `df_test`.

---

## ✅ What we expect
- Best LR in the classic **1e‑6 → 5e‑5** band.  
- **Deeper unfreezing** (larger *k*) tends to help once the head stabilizes.  
- Macro‑F1 is our north star; we’ll compare this RoBERTa baseline to mDeBERTa later.




In [None]:
# =========================
# ADV DL – Part B: Monolingual baseline (RoBERTa) – Exercise-4 style
# Custom loop + early stopping + W&B + Optuna ONLY; freeze base, unfreeze last k layers
# Uses df_train / df_test with columns: OriginalTweet (str), Sentiment (str)
# =========================

import os, math, random, time, json
from typing import Dict, List, Tuple

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, classification_report, confusion_matrix
import torch
from torch.utils.data import Dataset, DataLoader
from torch.cuda.amp import autocast, GradScaler

# ---- deps ----
!pip -q install transformers==4.43.3 optuna==3.6.1 wandb==0.17.5 >/dev/null

import transformers
from transformers import (
    AutoTokenizer, AutoModelForSequenceClassification,
    DataCollatorWithPadding, get_linear_schedule_with_warmup
)

import optuna
import wandb

# from google.colab import drive
# drive.mount("/content/drive")
DRIVE_OUT_DIR = "adv_dl_models"
os.makedirs(DRIVE_OUT_DIR, exist_ok=True)

# -------------------------
# Constants (no CFG, Optuna-only workflow)
# -------------------------
MODEL_NAME = "roberta-base"
MAX_LEN = 512
BATCH_SIZE = 16
WARMUP_RATIO = 0.06
GRAD_CLIP = 1.0
USE_AMP = True
PROJECT = "adv-dl-p1"
BASE_RUN_NAME = "roberta-base_ex4_style"
TRIALS = 12
SEED = 42
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

def set_seed(seed=42):
    random.seed(seed); np.random.seed(seed)
    torch.manual_seed(seed); torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

set_seed(SEED)

# -------------------------
# Label mapping (5-way sentiment)
# -------------------------
CANON = {
    "extremely negative": "extremely negative",
    "negative": "negative",
    "neutral": "neutral",
    "positive": "positive",
    "extremely positive": "extremely positive",
}
ORDER = ["extremely negative","negative","neutral","positive","extremely positive"]
LABEL2ID = {lab: i for i, lab in enumerate(ORDER)}
ID2LABEL = {i: lab for lab, i in LABEL2ID.items()}

def normalize_label(s: str) -> str:
    s = str(s).strip().lower()
    s = s.replace("very negative", "extremely negative")
    s = s.replace("very positive", "extremely positive")
    s = s.replace("extreme negative", "extremely negative")
    s = s.replace("extreme positive", "extremely positive")
    return CANON.get(s, s)

# -------------------------
# Expect df_train, df_test in memory
# -------------------------
assert "OriginalTweet" in df_train.columns and "Sentiment" in df_train.columns, "df_train missing required columns"
assert "OriginalTweet" in df_test.columns and "Sentiment" in df_test.columns, "df_test missing required columns"

def prep_df(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    df = df.dropna(subset=["OriginalTweet", "Sentiment"])
    df["text"] = df["OriginalTweet"].astype(str).str.strip()
    df["label_name"] = df["Sentiment"].apply(normalize_label)
    df = df[df["label_name"].isin(ORDER)].reset_index(drop=True)
    df["label"] = df["label_name"].map(LABEL2ID)
    return df[["text", "label", "label_name"]]

dftrain_ = prep_df(df_train)
dftest_  = prep_df(df_test)

train_df, val_df = train_test_split(
    dftrain_, test_size=0.1, stratify=dftrain_["label"], random_state=SEED
)
print(f"Train/Val/Test sizes: {len(train_df)}/{len(val_df)}/{len(dftest_)}")

# -------------------------
# Dataset & Collator
# -------------------------
class TweetDataset(Dataset):
    def __init__(self, df: pd.DataFrame, tokenizer: transformers.PreTrainedTokenizerBase, max_len: int):
        self.texts = df["text"].tolist()
        self.labels = df["label"].tolist()
        self.tok = tokenizer
        self.max_len = max_len
    def __len__(self): return len(self.texts)
    def __getitem__(self, idx):
        enc = self.tok(self.texts[idx], truncation=True, max_length=self.max_len, padding=False)
        enc["labels"] = self.labels[idx]
        return {k: torch.tensor(v) for k, v in enc.items()}

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)
train_ds = TweetDataset(train_df, tokenizer, MAX_LEN)
val_ds   = TweetDataset(val_df, tokenizer, MAX_LEN)
test_ds  = TweetDataset(dftest_, tokenizer, MAX_LEN)

collate_fn = DataCollatorWithPadding(tokenizer=tokenizer)
train_loader = DataLoader(train_ds, batch_size=BATCH_SIZE, shuffle=True,  collate_fn=collate_fn, num_workers=2, pin_memory=True)
val_loader   = DataLoader(val_ds,   batch_size=BATCH_SIZE, shuffle=False, collate_fn=collate_fn, num_workers=2, pin_memory=True)
test_loader  = DataLoader(test_ds,  batch_size=BATCH_SIZE, shuffle=False, collate_fn=collate_fn, num_workers=2, pin_memory=True)

# -------------------------
# Model & Freeze/Unfreeze strategy
# -------------------------
def build_model(num_unfreeze_last_layers: int = 4):
    model = AutoModelForSequenceClassification.from_pretrained(
        MODEL_NAME, num_labels=len(ORDER), id2label=ID2LABEL, label2id=LABEL2ID
    )
    base = getattr(model, "roberta", None) or getattr(model, "bert", None)
    if base is not None:
        for p in base.parameters(): p.requires_grad = False
        if hasattr(base, "encoder") and hasattr(base.encoder, "layer"):
            k = num_unfreeze_last_layers
            if k > 0:
                for layer in base.encoder.layer[-k:]:
                    for p in layer.parameters(): p.requires_grad = True
    for p in model.classifier.parameters(): p.requires_grad = True
    return model.to(DEVICE)

# -------------------------
# Train / Eval utilities
# -------------------------
def get_optimizer_scheduler(model, num_training_steps: int, lr: float, weight_decay: float):
    no_decay = ["bias", "LayerNorm.weight"]
    optimizer_grouped_parameters = [
        {"params": [p for n, p in model.named_parameters() if p.requires_grad and not any(nd in n for nd in no_decay)], "weight_decay": weight_decay},
        {"params": [p for n, p in model.named_parameters() if p.requires_grad and any(nd in n for nd in no_decay)],  "weight_decay": 0.0},
    ]
    optimizer = torch.optim.AdamW(optimizer_grouped_parameters, lr=lr)
    num_warmup = int(num_training_steps * WARMUP_RATIO)
    scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=num_warmup, num_training_steps=num_training_steps)
    return optimizer, scheduler

def evaluate(model, loader) -> Dict[str, float]:
    model.eval()
    preds, labels = [], []
    with torch.no_grad():
        for batch in loader:
            batch = {k: v.to(DEVICE) for k, v in batch.items()}
            logits = model(**batch).logits
            preds.extend(torch.argmax(logits, dim=-1).detach().cpu().tolist())
            labels.extend(batch["labels"].detach().cpu().tolist())
    acc = accuracy_score(labels, preds)
    p, r, f1, _ = precision_recall_fscore_support(labels, preds, average="macro", zero_division=0)
    return {"acc": acc, "precision": p, "recall": r, "f1": f1}

def train_one_run(hp: Dict) -> Tuple[str, Dict[str, float]]:
    """
    hp keys: run_name, num_unfreeze_last_layers, lr, weight_decay, epochs, patience, trial_number
    """
    run_name = hp["run_name"]
    num_unfreeze = int(hp["num_unfreeze_last_layers"])
    lr = float(hp["lr"])
    wd = float(hp["weight_decay"])
    epochs = int(hp["epochs"])
    patience = int(hp["patience"])

    model = build_model(num_unfreeze)
    total_steps = int(math.ceil(len(train_loader) * epochs))
    optimizer, scheduler = get_optimizer_scheduler(model, total_steps, lr, wd)

    scaler = GradScaler(enabled=(DEVICE == "cuda" and USE_AMP))
    best_metric = -1.0
    best_path = os.path.join(DRIVE_OUT_DIR, f"best_{run_name}.pt")
    no_improve = 0

    wandb_run = wandb.init(
        project=PROJECT,
        name=run_name,
        config={
            "model": MODEL_NAME,
            "max_len": MAX_LEN,
            "batch_size": BATCH_SIZE,
            "epochs": epochs,
            "lr": lr,
            "weight_decay": wd,
            "warmup_ratio": WARMUP_RATIO,
            "grad_clip": GRAD_CLIP,
            "num_unfreeze_last_layers": num_unfreeze,
        },
        reinit=True,
    )

    for epoch in range(epochs):
        model.train()
        t0 = time.time()
        running_loss = 0.0
        for step, batch in enumerate(train_loader):
            batch = {k: v.to(DEVICE) for k, v in batch.items()}
            optimizer.zero_grad(set_to_none=True)
            with autocast(enabled=(DEVICE == "cuda" and USE_AMP)):
                outputs = model(**batch)
                loss = outputs.loss
            scaler.scale(loss).backward()
            if GRAD_CLIP is not None:
                scaler.unscale_(optimizer)
                torch.nn.utils.clip_grad_norm_(model.parameters(), GRAD_CLIP)
            scaler.step(optimizer); scaler.update(); scheduler.step()
            running_loss += loss.item()

            if step % 20 == 0:
                wandb.log({"train/loss": loss.item(), "step": step + 1, "epoch": epoch + 1})

        # epoch-end validation
        val_metrics = evaluate(model, val_loader)
        elapsed = time.time() - t0

        epoch_loss = running_loss / max(1, len(train_loader))
        current_lr = scheduler.get_last_lr()[0]
        wandb.log({
            "train/epoch_loss": epoch_loss,
            "val/acc": val_metrics["acc"],
            "val/precision": val_metrics["precision"],
            "val/recall": val_metrics["recall"],
            "val/f1": val_metrics["f1"],
            "lr": current_lr,
            "time/epoch_sec": elapsed,
            "epoch": epoch + 1,
        })

        # Early stopping on val f1
        if val_metrics["f1"] > best_metric:
            best_metric = val_metrics["f1"]
            torch.save(model.state_dict(), best_path)
            no_improve = 0
            wandb_run.summary["best_val_f1"] = best_metric
            wandb_run.summary["best_checkpoint_path"] = best_path
        else:
            no_improve += 1
            if no_improve >= patience:
                print(f"Early stopping at epoch {epoch+1}")
                break

        print(f"Epoch {epoch+1}/{epochs} | "
              f"loss={epoch_loss:.4f} | "
              f"val_acc={val_metrics['acc']:.4f} | val_f1={val_metrics['f1']:.4f} | time={elapsed:.1f}s")

    wandb.finish()

    # Load best and return path + metrics on val for reference
    model.load_state_dict(torch.load(best_path, map_location=DEVICE))
    final_val = evaluate(model, val_loader)
    return best_path, final_val

# -------------------------
# Optuna hyperparameter tuning (ALWAYS ON)
# -------------------------
def objective(trial: optuna.trial.Trial):
    params = {
        "run_name": f"{BASE_RUN_NAME}_optuna_trial_{trial.number}",
        "num_unfreeze_last_layers": trial.suggest_int("num_unfreeze_last_layers", 1, 6),
        "lr": trial.suggest_float("lr", 1e-6, 5e-5, log=True),
        "weight_decay": trial.suggest_float("weight_decay", 1e-6, 1e-1, log=True),
        "epochs": trial.suggest_int("epochs", 4, 12),
        "patience": trial.suggest_int("patience", 1, 4),
        "trial_number": trial.number,
    }
    path, val_metrics = train_one_run(params)
    # report intermediate value for pruning if enabled
    trial.report(val_metrics["f1"], step=1)
    return val_metrics["f1"]

study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=TRIALS, show_progress_bar=True)
print("Best trial:", study.best_trial.number, "F1:", study.best_value)
best_params = {"run_name": f"{BASE_RUN_NAME}_best_optuna", **study.best_trial.params}

# Retrain best config to get a clean checkpoint
best_ckpt, _ = train_one_run(best_params)
best_path = best_ckpt

# -------------------------
# Final evaluation on TEST (+ W&B logging)
# -------------------------
model = build_model(best_params["num_unfreeze_last_layers"])
model.load_state_dict(torch.load(best_path, map_location=DEVICE))
model.eval()

all_preds, all_labels = [], []
with torch.no_grad():
    for batch in test_loader:
        batch = {k: v.to(DEVICE) for k, v in batch.items()}
        logits = model(**batch).logits
        all_preds.extend(torch.argmax(logits, dim=-1).detach().cpu().tolist())
        all_labels.extend(batch["labels"].detach().cpu().tolist())

acc = accuracy_score(all_labels, all_preds)
p, r, f1, _ = precision_recall_fscore_support(all_labels, all_preds, average="macro", zero_division=0)
print(f"\nTEST | acc={acc:.4f} | f1_macro={f1:.4f} | precision_macro={p:.4f} | recall_macro={r:.4f}\n")

print("Per-class report (ids map to labels):")
print(ID2LABEL)
report = classification_report(
    all_labels, all_preds,
    target_names=[ID2LABEL[i] for i in range(len(ORDER))],
    zero_division=0, output_dict=True
)
print(classification_report(
    all_labels, all_preds,
    target_names=[ID2LABEL[i] for i in range(len(ORDER))],
    zero_division=0
))

# # ---- W&B: log test metrics, per-class scores, and confusion matrix ----
# test_run = wandb.init(project=PROJECT, name=f"{BASE_RUN_NAME}_test", resume="allow", reinit=True)
# log_payload = {
#     "test/acc": acc,
#     "test/precision_macro": p,
#     "test/recall_macro": r,
#     "test/f1_macro": f1,
# }
# for cls_name in ORDER:
#     if cls_name in report:
#         log_payload[f"test/{cls_name}/precision"] = report[cls_name]["precision"]
#         log_payload[f"test/{cls_name}/recall"]    = report[cls_name]["recall"]
#         log_payload[f"test/{cls_name}/f1"]        = report[cls_name]["f1-score"]

# wandb.log(log_payload)

# cm = confusion_matrix(all_labels, all_preds, labels=list(range(len(ORDER))))
# wandb.log({
#     "test/confusion_matrix": wandb.plot.confusion_matrix(
#         y_true=all_labels,
#         preds=all_preds,
#         class_names=[ID2LABEL[i] for i in range(len(ID2LABEL))]
#     )
# })
# test_run.summary["best_checkpoint_path"] = best_path
# test_run.summary["test_f1_macro"] = f1
# wandb.finish()


# 🔧 Refined Training Study – RoBERTa (EX4 Style)

From the previous runs we saw that **freezing more layers improved performance** (easy to observe from the W&B hyperparameter graphs 📈).  

In this refined study we make two important adjustments:  
- ⏳ **Fixed Epochs & Patience**: to keep results consistent and reduce randomness from early stopping.  
- 📉 **Expanded Learning Rate Range**: Optuna now searches up to **7e-5**, giving a wider window to explore.  

Additional details:  
- 🧊 **Frozen Layers**: number of unfreezed layers remains in the [4–6] range.  
- ⚖️ **Weight Decay**: still optimized in [1e-6, 1e-1].  
- 📦 **Batch Size**: multiple options considered [4, 8, 16, 32, 64].  

This setup should allow us to better capture the trade-off between **learning rate**, **frozen layers**, and **batch size**, while keeping training length controlled.


In [None]:

PROJECT = "adv-dl-p2"


# -------------------------
# Optuna hyperparameter tuning (ALWAYS ON)
# -------------------------

# Constants
FIXED_EPOCHS = 12
FIXED_PATIENCE = 4


def objective(trial: optuna.trial.Trial):
    params = {
        "run_name": f"{BASE_RUN_NAME}_optuna_trial_{trial.number}",
        "num_unfreeze_last_layers": trial.suggest_int("num_unfreeze_last_layers", 4, 6),
        "lr": trial.suggest_float("lr", 1e-6, 7e-5, log=True),
        "weight_decay": trial.suggest_float("weight_decay", 1e-6, 1e-1, log=True),
        "batch_size": trial.suggest_categorical("batch_size", [4, 8, 16, 32, 64]),
        "epochs": FIXED_EPOCHS,
        "patience": FIXED_PATIENCE,
        "trial_number": trial.number,
    }
    path, val_metrics = train_one_run(params)
    # report intermediate value for pruning if enabled
    trial.report(val_metrics["f1"], step=1)
    return val_metrics["f1"]

study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=TRIALS, show_progress_bar=True)
print("Best trial:", study.best_trial.number, "F1:", study.best_value)
best_params = {"run_name": f"{BASE_RUN_NAME}_best_optuna", **study.best_trial.params}

# Retrain best config to get a clean checkpoint
best_ckpt, _ = train_one_run(best_params)
best_path = best_ckpt

# -------------------------
# Final evaluation on TEST (+ W&B logging)
# -------------------------
model = build_model(best_params["num_unfreeze_last_layers"])
model.load_state_dict(torch.load(best_path, map_location=DEVICE))
model.eval()

all_preds, all_labels = [], []
with torch.no_grad():
    for batch in test_loader:
        batch = {k: v.to(DEVICE) for k, v in batch.items()}
        logits = model(**batch).logits
        all_preds.extend(torch.argmax(logits, dim=-1).detach().cpu().tolist())
        all_labels.extend(batch["labels"].detach().cpu().tolist())

acc = accuracy_score(all_labels, all_preds)
p, r, f1, _ = precision_recall_fscore_support(all_labels, all_preds, average="macro", zero_division=0)
print(f"\nTEST | acc={acc:.4f} | f1_macro={f1:.4f} | precision_macro={p:.4f} | recall_macro={r:.4f}\n")

print("Per-class report (ids map to labels):")
print(ID2LABEL)
report = classification_report(
    all_labels, all_preds,
    target_names=[ID2LABEL[i] for i in range(len(ORDER))],
    zero_division=0, output_dict=True
)
print(classification_report(
    all_labels, all_preds,
    target_names=[ID2LABEL[i] for i in range(len(ORDER))],
    zero_division=0
))

# # ---- W&B: log test metrics, per-class scores, and confusion matrix ----
# test_run = wandb.init(project=PROJECT, name=f"{BASE_RUN_NAME}_test", resume="allow", reinit=True)
# log_payload = {
#     "test/acc": acc,
#     "test/precision_macro": p,
#     "test/recall_macro": r,
#     "test/f1_macro": f1,
# }
# for cls_name in ORDER:
#     if cls_name in report:
#         log_payload[f"test/{cls_name}/precision"] = report[cls_name]["precision"]
#         log_payload[f"test/{cls_name}/recall"]    = report[cls_name]["recall"]
#         log_payload[f"test/{cls_name}/f1"]        = report[cls_name]["f1-score"]

# wandb.log(log_payload)

# cm = confusion_matrix(all_labels, all_preds, labels=list(range(len(ORDER))))
# wandb.log({
#     "test/confusion_matrix": wandb.plot.confusion_matrix(
#         y_true=all_labels,
#         preds=all_preds,
#         class_names=[ID2LABEL[i] for i in range(len(ID2LABEL))]
#     )
# })
# test_run.summary["best_checkpoint_path"] = best_path
# test_run.summary["test_f1_macro"] = f1
# wandb.finish()

# 📊 Training Results – Refined Study (RoBERTa EX4 Style)

✅ The results indicate that **small batch sizes** and a **higher number of unfrozen layers** lead to better performance.  
It is possible that unfreezing even more layers could further improve results, but we decided to **stick with 6 layers** to balance the **performance–computation trade-off**.  

---

### 🏆 Best Trial Hyperparameters
```json
{
  "num_unfreeze_last_layers": 6,
  "lr": 4.2813e-5,
  "weight_decay": 6.2665e-6,
  "batch_size": 4,
  "epochs": 12,
  "patience": 4
}



# 🏁 Final Training – RoBERTa-base (Best Optuna HP)

After completing our hyperparameter search, we now perform a **one-shot full training** run using the **best trial parameters**.  
All results are logged to **Weights & Biases (W&B)** for tracking.

---

## ⚙️ Final Hyperparameters
- **Model**: `roberta-base`  
- **Max length**: 512  
- **Batch size**: 4  
- **Learning rate**: 4.28e-5  
- **Weight decay**: 6.27e-6  
- **Unfrozen layers**: last 6  
- **Epochs**: 12 (with early stopping, patience = 4)  
- **Optimizer**: AdamW + linear warmup (6% steps)  
- **Mixed Precision**: ✅ (AMP)  
- **Grad Clipping**: 1.0  

---

## 📊 Training Setup
- 🔒 Freeze most RoBERTa layers, unfreeze **last 6** for fine-tuning.  
- 🏷 5-way sentiment classification head.  
- ⏳ Early stopping on validation F1 (patience = 4).  
- 📈 W&B logs include **loss curves, F1/accuracy per epoch, and confusion matrix**.  

---

## ✅ Final Test Evaluation
At the end of training, we load the **best checkpoint** and evaluate on the test set:



# Load Best Model

In [2]:
# =========================
# ADV DL – One-shot RoBERTa training using best Optuna params
# Logs everything to W&B
# =========================

import os, math, time, random
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, classification_report, confusion_matrix


DRIVE_OUT_DIR = "adv_dl_models_final"
os.makedirs(DRIVE_OUT_DIR, exist_ok=True)

# -------------------------
# Constants / HPs
# -------------------------
MODEL_NAME = "roberta-base"
PROJECT = "adv-dl-p3"
RUN_NAME = "roberta_base_best_manual"

MAX_LEN = 512
BATCH_SIZE = 4
LR = 4.2813e-5
WEIGHT_DECAY = 6.2665e-6
NUM_UNFREEZE = 6
EPOCHS = 12
PATIENCE = 4
GRAD_CLIP = 1.0
WARMUP_RATIO = 0.06
USE_AMP = True
SEED = 42
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

def set_seed(seed=42):
    random.seed(seed); np.random.seed(seed)
    torch.manual_seed(seed); torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

set_seed(SEED)

# -------------------------
# Labels
# -------------------------
ORDER = ["extremely negative", "negative", "neutral", "positive", "extremely positive"]
LABEL2ID = {lab: i for i, lab in enumerate(ORDER)}
ID2LABEL = {i: lab for lab, i in LABEL2ID.items()}
CANON = {k: k for k in ORDER}


train_df = prep_df(df_train)
test_df  = prep_df(df_test)

# -------------------------
# Dataset & Tokenization
# -------------------------
class TweetDataset(Dataset):
    def __init__(self, df, tokenizer, max_len):
        self.texts = df["text"].tolist()
        self.labels = df["label"].tolist()
        self.tokenizer = tokenizer
        self.max_len = max_len
    def __len__(self): return len(self.texts)
    def __getitem__(self, idx):
        enc = self.tokenizer(self.texts[idx], truncation=True, max_length=self.max_len, padding=False)
        enc["labels"] = self.labels[idx]
        return {k: torch.tensor(v) for k, v in enc.items()}

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)
collate_fn = DataCollatorWithPadding(tokenizer)

train_ds = TweetDataset(train_df, tokenizer, MAX_LEN)
test_ds  = TweetDataset(test_df, tokenizer, MAX_LEN)

train_loader = DataLoader(train_ds, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_fn, num_workers=2, pin_memory=True)
test_loader  = DataLoader(test_ds, batch_size=BATCH_SIZE, shuffle=False, collate_fn=collate_fn, num_workers=2, pin_memory=True)

# -------------------------
# Model Setup (Freeze + Unfreeze last k)
# -------------------------
def build_model(num_unfreeze):
    model = AutoModelForSequenceClassification.from_pretrained(
        MODEL_NAME, num_labels=len(ORDER), id2label=ID2LABEL, label2id=LABEL2ID
    )
    base = getattr(model, "roberta", None) or getattr(model, "bert", None)
    if base:
        for p in base.parameters(): p.requires_grad = False
        if hasattr(base, "encoder") and hasattr(base.encoder, "layer"):
            for layer in base.encoder.layer[-num_unfreeze:]:
                for p in layer.parameters(): p.requires_grad = True
    for p in model.classifier.parameters(): p.requires_grad = True
    return model.to(DEVICE)

# -------------------------
# Optimizer + Scheduler
# -------------------------
def get_optimizer_scheduler(model, total_steps, lr, weight_decay):
    no_decay = ["bias", "LayerNorm.weight"]
    grouped = [
        {"params": [p for n, p in model.named_parameters() if p.requires_grad and not any(nd in n for nd in no_decay)], "weight_decay": weight_decay},
        {"params": [p for n, p in model.named_parameters() if p.requires_grad and any(nd in n for nd in no_decay)],  "weight_decay": 0.0},
    ]
    optimizer = torch.optim.AdamW(grouped, lr=lr)
    warmup_steps = int(WARMUP_RATIO * total_steps)
    scheduler = get_linear_schedule_with_warmup(optimizer, warmup_steps, total_steps)
    return optimizer, scheduler

# -------------------------
# Evaluation
# -------------------------
def evaluate(model, loader):
    model.eval()
    preds, labels = [], []
    with torch.no_grad():
        for batch in loader:
            batch = {k: v.to(DEVICE) for k, v in batch.items()}
            logits = model(**batch).logits
            preds.extend(logits.argmax(dim=-1).cpu().tolist())
            labels.extend(batch["labels"].cpu().tolist())
    acc = accuracy_score(labels, preds)
    p, r, f1, _ = precision_recall_fscore_support(labels, preds, average="macro", zero_division=0)
    return acc, p, r, f1, labels, preds

# -------------------------
# Train + Save Best
# -------------------------
model = build_model(NUM_UNFREEZE)
steps_per_epoch = math.ceil(len(train_loader))
total_steps = steps_per_epoch * EPOCHS
optimizer, scheduler = get_optimizer_scheduler(model, total_steps, LR, WEIGHT_DECAY)
scaler = GradScaler(enabled=(DEVICE == "cuda" and USE_AMP))

best_f1, no_improve = -1, 0
best_path = os.path.join(DRIVE_OUT_DIR, f"{RUN_NAME}.pt")

wandb.init(project=PROJECT, name=RUN_NAME, reinit=True, config={
    "lr": LR,
    "weight_decay": WEIGHT_DECAY,
    "batch_size": BATCH_SIZE,
    "epochs": EPOCHS,
    "patience": PATIENCE,
    "num_unfreeze": NUM_UNFREEZE
})

for epoch in range(EPOCHS):
    model.train()
    t0 = time.time()
    losses = []

    for step, batch in enumerate(train_loader):
        batch = {k: v.to(DEVICE) for k, v in batch.items()}
        optimizer.zero_grad(set_to_none=True)
        with autocast(enabled=(DEVICE == "cuda" and USE_AMP)):
            outputs = model(**batch)
            loss = outputs.loss
        scaler.scale(loss).backward()
        if GRAD_CLIP:
            scaler.unscale_(optimizer)
            torch.nn.utils.clip_grad_norm_(model.parameters(), GRAD_CLIP)
        scaler.step(optimizer); scaler.update(); scheduler.step()
        losses.append(loss.item())

    avg_loss = np.mean(losses)
    acc, p, r, f1, _, _ = evaluate(model, train_loader)
    wandb.log({
        "train/loss": avg_loss,
        "train/acc": acc,
        "train/precision": p,
        "train/recall": r,
        "train/f1": f1,
        "epoch": epoch + 1
    })

    if f1 > best_f1:
        best_f1 = f1
        no_improve = 0
        torch.save(model.state_dict(), best_path)
    else:
        no_improve += 1
        if no_improve >= PATIENCE:
            print(f"Early stopping at epoch {epoch+1}")
            break

# -------------------------
# Final Test Evaluation + W&B Logging
# -------------------------
model.load_state_dict(torch.load(best_path, map_location=DEVICE))
acc, p, r, f1, all_labels, all_preds = evaluate(model, test_loader)

print(f"\nTEST METRICS:")
print(f"Accuracy: {acc:.4f}")
print(f"F1 Macro: {f1:.4f}")
print(f"Precision: {p:.4f}")
print(f"Recall: {r:.4f}")

report = classification_report(
    all_labels, all_preds,
    target_names=[ID2LABEL[i] for i in range(len(ORDER))],
    zero_division=0, output_dict=True
)

wandb.log({
    "test/acc": acc,
    "test/precision_macro": p,
    "test/recall_macro": r,
    "test/f1_macro": f1,
    "test/confusion_matrix": wandb.plot.confusion_matrix(
        y_true=all_labels,
        preds=all_preds,
        class_names=[ID2LABEL[i] for i in range(len(ID2LABEL))]
    )
})
for cls in ORDER:
    if cls in report:
        wandb.log({
            f"test/{cls}/precision": report[cls]["precision"],
            f"test/{cls}/recall": report[cls]["recall"],
            f"test/{cls}/f1": report[cls]["f1-score"]
        })

wandb.finish()


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  scaler = GradScaler(enabled=(DEVICE == "cuda" and USE_AMP))


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mgal2361[0m ([33mgal2361-tel-aviv-university[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


  with autocast(enabled=(DEVICE == "cuda" and USE_AMP)):
  with autocast(enabled=(DEVICE == "cuda" and USE_AMP)):
  with autocast(enabled=(DEVICE == "cuda" and USE_AMP)):
  with autocast(enabled=(DEVICE == "cuda" and USE_AMP)):
  with autocast(enabled=(DEVICE == "cuda" and USE_AMP)):
  with autocast(enabled=(DEVICE == "cuda" and USE_AMP)):
  with autocast(enabled=(DEVICE == "cuda" and USE_AMP)):
  with autocast(enabled=(DEVICE == "cuda" and USE_AMP)):
  with autocast(enabled=(DEVICE == "cuda" and USE_AMP)):
  with autocast(enabled=(DEVICE == "cuda" and USE_AMP)):
  with autocast(enabled=(DEVICE == "cuda" and USE_AMP)):
  with autocast(enabled=(DEVICE == "cuda" and USE_AMP)):



TEST METRICS:
Accuracy: 0.7725
F1 Macro: 0.7794
Precision: 0.7786
Recall: 0.7805


0,1
epoch,▁▂▂▃▄▄▅▅▆▇▇█
test/acc,▁
test/extremely negative/f1,▁
test/extremely negative/precision,▁
test/extremely negative/recall,▁
test/extremely positive/f1,▁
test/extremely positive/precision,▁
test/extremely positive/recall,▁
test/f1_macro,▁
test/negative/f1,▁

0,1
epoch,12.0
test/acc,0.77251
test/extremely negative/f1,0.80336
test/extremely negative/precision,0.79933
test/extremely negative/recall,0.80743
test/extremely positive/f1,0.81657
test/extremely positive/precision,0.82705
test/extremely positive/recall,0.80634
test/f1_macro,0.77942
test/negative/f1,0.74709


Creating a detailed classification report

In [2]:
import torch
import pandas as pd
from sklearn.metrics import classification_report
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from torch.utils.data import Dataset, DataLoader
from transformers import DataCollatorWithPadding
# from google.colab import drive
from sklearn.metrics import classification_report

# -----------------------------
# Mount Google Drive
# -----------------------------
# drive.mount("/content/drive")
# -------------------------
# Constants
# -------------------------
MODEL_NAME = "roberta-base"
MODEL_PATH = "adv_dl_models_final/roberta_base_best_manual.pt"
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
MAX_LEN = 512
BATCH_SIZE = 4

ORDER = ["extremely negative", "negative", "neutral", "positive", "extremely positive"]
LABEL2ID = {lab: i for i, lab in enumerate(ORDER)}
ID2LABEL = {i: lab for lab, i in enumerate(ORDER)}

# -------------------------
# Load Tokenizer + Model
# -------------------------
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=len(ORDER))
model.load_state_dict(torch.load(MODEL_PATH, map_location=DEVICE))
model.to(DEVICE)
model.eval()

# -------------------------
# Data Preparation
# -------------------------
def normalize_label(s: str) -> str:
    s = str(s).strip().lower()
    s = s.replace("very negative", "extremely negative")
    s = s.replace("very positive", "extremely positive")
    s = s.replace("extreme negative", "extremely negative")
    s = s.replace("extreme positive", "extremely positive")
    return s

def prep_df(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    df = df.dropna(subset=["OriginalTweet", "Sentiment"])
    df["text"] = df["OriginalTweet"].astype(str).str.strip()
    df["label_name"] = df["Sentiment"].apply(normalize_label)
    df = df[df["label_name"].isin(ORDER)].reset_index(drop=True)
    df["label"] = df["label_name"].map(LABEL2ID)
    return df[["text", "label", "label_name"]]

# Load your test dataframe
df_test = pd.read_csv("Corona_NLP_test_cleaned_translated.csv")
test_df = prep_df(df_test)

class TweetDataset(Dataset):
    def __init__(self, df, tokenizer, max_len):
        self.texts = df["text"].tolist()
        self.labels = df["label"].tolist()
        self.tokenizer = tokenizer
        self.max_len = max_len
    def __len__(self): return len(self.texts)
    def __getitem__(self, idx):
        enc = self.tokenizer(self.texts[idx], truncation=True, max_length=self.max_len, padding=False)
        enc["labels"] = self.labels[idx]
        return {k: torch.tensor(v) for k, v in enc.items()}

test_ds = TweetDataset(test_df, tokenizer, MAX_LEN)
collate_fn = DataCollatorWithPadding(tokenizer)
test_loader = DataLoader(test_ds, batch_size=BATCH_SIZE, shuffle=False, collate_fn=collate_fn)

# -------------------------
# Evaluation
# -------------------------
all_preds, all_labels = [], []
with torch.no_grad():
    for batch in test_loader:
        input_ids = batch["input_ids"].to(DEVICE)
        attention_mask = batch["attention_mask"].to(DEVICE)
        labels = batch["labels"].to(DEVICE)

        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        preds = outputs.logits.argmax(dim=-1)

        all_preds.extend(preds.cpu().numpy())
        all_labels.extend(labels.cpu().numpy())

# -------------------------
# Report
# -------------------------
print(classification_report(
    all_labels, all_preds,
    target_names=ORDER,
    zero_division=0
))


  from .autonotebook import tqdm as notebook_tqdm
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


                    precision    recall  f1-score   support

extremely negative       0.80      0.81      0.80       592
          negative       0.75      0.74      0.75      1041
           neutral       0.77      0.80      0.79       619
          positive       0.75      0.74      0.74       947
extremely positive       0.83      0.81      0.82       599

          accuracy                           0.77      3798
         macro avg       0.78      0.78      0.78      3798
      weighted avg       0.77      0.77      0.77      3798



# 📊 Final Test Results – RoBERTa Run

---

## ✅ TEST METRICS
- **Accuracy**: 0.7725  
- **F1 Macro**: 0.7794  
- **Precision Macro**: 0.7786  
- **Recall Macro**: 0.7805  

---

## 📑 Per-class Report
| Sentiment            | Precision | Recall | F1   | Support |
|-----------------------|-----------|--------|------|---------|
| Extremely Negative    | 0.80      | 0.81   | 0.80 | 592     |
| Negative              | 0.75      | 0.74   | 0.75 | 1041    |
| Neutral               | 0.77      | 0.80   | 0.79 | 619     |
| Positive              | 0.75      | 0.74   | 0.74 | 947     |
| Extremely Positive    | 0.83      | 0.81   | 0.82 | 599     |

**Accuracy**: 0.77 | **Macro Avg**: 0.78 | **Weighted Avg**: 0.77  

---

## 📈 Run Summary
- **Epochs**: 12  
- **Train Acc**: 0.9843  
- **Train F1**: 0.9844  
- **Train Loss**: 0.1739  

**Train vs Test Gap**: Strong training performance (~98%) but lower test generalization (~77%), indicating potential **overfitting**.  

---

## 🧭 Key Observations
- Mid-sentiment classes (`negative`, `neutral`, `positive`) underperform compared to extremes.  
- **Extremely Negative / Extremely Positive** achieve **>0.80 F1**, showing the model learns polarized sentiments better.  
- Further regularization or balanced class weighting might help close the gap.  


Taking best model from optuna to see if it is any better.

In [6]:
# import torch
# from transformers import RobertaTokenizer, RobertaForSequenceClassification
# from torch.utils.data import DataLoader
# import pandas as pd
# from google.colab import drive
# from sklearn.metrics import classification_report


# # -----------------------------
# # PATHS & CONSTANTS
# # -----------------------------
# MODEL_PATH = "adv_dl_models/best_roberta-base_ex4_style_2_optuna_trial_4.pt"
# MODEL_NAME = "roberta-base"
# BATCH_SIZE = 4
# NUM_LABELS = 5

# # -----------------------------
# # Load Tokenizer and Model
# # -----------------------------
# tokenizer = RobertaTokenizer.from_pretrained(MODEL_NAME)
# model = RobertaForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=NUM_LABELS)
# model.load_state_dict(torch.load(MODEL_PATH, map_location="cuda" if torch.cuda.is_available() else "cpu"))
# model.eval().to("cuda" if torch.cuda.is_available() else "cpu")

# # -----------------------------
# # Load Test Data
# # -----------------------------
# df_test = pd.read_csv("Corona_NLP_test_cleaned_translated.csv")

# label_map = {
#     "Extremely Negative": 0,
#     "Negative": 1,
#     "Neutral": 2,
#     "Positive": 3,
#     "Extremely Positive": 4,
# }

# texts = df_test["OriginalTweet"].tolist()
# labels = df_test["Sentiment"].map(label_map).tolist()

# # -----------------------------
# # Tokenize
# # -----------------------------
# encodings = tokenizer(texts, truncation=True, padding=True, return_tensors="pt")
# input_ids = encodings["input_ids"]
# attention_mask = encodings["attention_mask"]

# # -----------------------------
# # Dataset & Dataloader
# # -----------------------------
# class TestDataset(torch.utils.data.Dataset):
#     def __init__(self, input_ids, attention_mask, labels):
#         self.input_ids = input_ids
#         self.attention_mask = attention_mask
#         self.labels = labels

#     def __len__(self):
#         return len(self.labels)

#     def __getitem__(self, idx):
#         return {
#             "input_ids": self.input_ids[idx],
#             "attention_mask": self.attention_mask[idx],
#             "labels": torch.tensor(self.labels[idx]),
#         }

# dataset = TestDataset(input_ids, attention_mask, labels)
# loader = DataLoader(dataset, batch_size=BATCH_SIZE)

# # -----------------------------
# # Evaluation Loop
# # -----------------------------
# all_preds = []
# all_labels = []

# device = "cuda" if torch.cuda.is_available() else "cpu"
# with torch.no_grad():
#     for batch in loader:
#         input_ids = batch["input_ids"].to(device)
#         attention_mask = batch["attention_mask"].to(device)
#         labels = batch["labels"].to(device)

#         outputs = model(input_ids=input_ids, attention_mask=attention_mask)
#         preds = torch.argmax(outputs.logits, dim=1)

#         all_preds.extend(preds.cpu().numpy())
#         all_labels.extend(labels.cpu().numpy())

# # -----------------------------
# # Classification Report
# # -----------------------------
# print(classification_report(all_labels, all_preds, target_names=[
#     "Extremely Negative", "Negative", "Neutral", "Positive", "Extremely Positive"
# ]))


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


                    precision    recall  f1-score   support

Extremely Negative       0.81      0.75      0.78       592
          Negative       0.70      0.76      0.73      1041
           Neutral       0.85      0.78      0.81       619
          Positive       0.72      0.74      0.73       947
Extremely Positive       0.81      0.80      0.80       599

          accuracy                           0.76      3798
         macro avg       0.78      0.76      0.77      3798
      weighted avg       0.77      0.76      0.76      3798



# 📊 Validation vs Test – Best RoBERTa Trials

We compare the **best trial from Optuna (validation performance)** against the **final best model evaluated on the test set**.

---

## ✅ Test (Best Trial)
| Class                | Precision | Recall | F1-score | Support |
|-----------------------|-----------|--------|----------|---------|
| Extremely Negative    | 0.80      | 0.81   | 0.80     | 592     |
| Negative              | 0.75      | 0.74   | 0.75     | 1041    |
| Neutral               | 0.77      | 0.80   | 0.79     | 619     |
| Positive              | 0.75      | 0.74   | 0.74     | 947     |
| Extremely Positive    | 0.83      | 0.81   | 0.82     | 599     |

**Overall:**  
- Accuracy = **0.77**  
- Macro F1 = **0.78**  
- Weighted F1 = **0.77**

---

## 🔎 Optuna Best (Validation)
| Class                | Precision | Recall | F1-score | Support |
|-----------------------|-----------|--------|----------|---------|
| Extremely Negative    | 0.81      | 0.75   | 0.78     | 592     |
| Negative              | 0.70      | 0.76   | 0.73     | 1041    |
| Neutral               | 0.85      | 0.78   | 0.81     | 619     |
| Positive              | 0.72      | 0.74   | 0.73     | 947     |
| Extremely Positive    | 0.81      | 0.80   | 0.80     | 599     |

**Overall:**  
- Accuracy = **0.76**  
- Macro F1 = **0.77**  
- Weighted F1 = **0.76**

---

## 📌 Observation
- Both models are **very close** in macro F1 (0.77–0.78).  
- Test model is slightly **more balanced**, performing better on *Negative* and *Positive*.  
- Validation (Optuna best) showed stronger *Neutral* class performance.  
- Overall, both are in line with **RoBERTa-base general performance (~0.77–0.78 F1)**.

---

