### Multimodal Model: Early Fusion (CLIP + DeBERTa)

This notebook establishes our baseline multimodal model for stance classification.

Architecture:
  - Text Branch:  DeBERTa
  - Image Branch: CLIP
  - Fusion:       Early Fusion (concatenate embeddings)
  - Classifier:   MLP

Strategy:
  - Tested with our data augmentated. (Previous notebooks)
  - No gating mechanism (all images used)
  - Simple concatenation fusion
  - Standard hyperparameters

In [1]:
# Libraries
import os
import sys
import random
import warnings
warnings.filterwarnings("ignore")

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.metrics import (
    accuracy_score, f1_score, precision_score, recall_score,
    confusion_matrix, classification_report
)

from tqdm.auto import tqdm
from PIL import Image

from transformers import (
    AutoTokenizer, AutoModel,
    CLIPModel, CLIPProcessor,
    get_linear_schedule_with_warmup
)

# --- Optional: avoid HF tokenizers fork warnings + improve stability
os.environ["TOKENIZERS_PARALLELISM"] = "false"

# -------------------------
# Reproducibility
# -------------------------
SEED = 42

def seed_everything(seed: int = 42):
    random.seed(seed)
    np.random.seed(seed)

    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)

    # Determinism (may slightly reduce speed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

    # Stronger determinism (PyTorch >= 1.8)
    # Note: some ops may throw if non-deterministic; if that happens, set to False.
    try:
        torch.use_deterministic_algorithms(True)
        os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:8"
    except Exception as e:
        print(f"[WARN] Deterministic algorithms not fully enabled: {e}")

seed_everything(SEED)

# -------------------------
# Device configuration
# -------------------------
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Seed:  {SEED}")
print(f"Using device: {DEVICE}")
if torch.cuda.is_available():
    print("GPU:", torch.cuda.get_device_name(0))


Seed:  42
Using device: cuda
GPU: NVIDIA H100 NVL


In [2]:
#Paths
DATA_PATH = "../../data/"
IMG_PATH = "../../data/images"
OUTPUT_DIR = "../../results/multimodal/baseline_multimodal/"
os.makedirs(OUTPUT_DIR, exist_ok=True)

train_path = os.path.join(DATA_PATH,"train_augmented.csv")
dev_path   = os.path.join(DATA_PATH,"dev.csv")
test_path  = os.path.join(DATA_PATH,"test.csv")

#Load Data
df_train = pd.read_csv(train_path)
df_dev   = pd.read_csv(dev_path)
df_test  = pd.read_csv(test_path)

# Map labels to ints
stance_2id = {"oppose": 0, "support": 1}
pers_2id = {"no": 0, "yes": 1}

for df in [df_train, df_dev, df_test]:
    df["label"] = df["stance"].map(stance_2id)
    df["persuasiveness_label"] = df["persuasiveness"].map(pers_2id)


print(f"\n Train label distribution:")
print(f"\n Stance: \n Oppose: {(df_train['label']==0).sum()}\n Support: {(df_train['label']==1).sum()}")
print(f"\n\n  Persuasiveness \n No: {(df_train['persuasiveness_label']==0).sum()}\n Yes: {(df_train['persuasiveness_label']==1).sum()}")


df_train.head()


 Train label distribution:

 Stance: 
 Oppose: 1095
 Support: 1095


  Persuasiveness 
 No: 1548
 Yes: 642


Unnamed: 0,tweet_id,tweet_url,tweet_text,stance,persuasiveness,split,label,persuasiveness_label
0,1148501065308004357,https://t.co/VQP1FHaWAg,Let's McGyver some Sanity in America!\n\nYou a...,support,no,train,1,0
1,1103872992537276417,https://t.co/zsyXYSeBkp,A child deserves a chance at life. A child des...,oppose,no,train,0,0
2,1151528583623585794_aug,https://t.co/qSWvDX5MnM,"Dear prolifers: girls as young as 10, 11, 12 a...",support,no,train,1,0
3,1100166844026109953,https://t.co/hxH8tFIHUu,The many States will attempt to amend their co...,support,no,train,1,0
4,1021830413550067713,https://t.co/5whvEEtoQR,"Every #abortion is wrong, no matter what metho...",oppose,yes,train,0,1


In [3]:
#Models
TEXT_MODEL_NAME = "microsoft/deberta-v3-base"
VISION_MODEL_NAME = "openai/clip-vit-base-patch32" 

In [4]:
 # Training hyperparameters
BATCH_SIZE = 16
NUM_EPOCHS = 10
LEARNING_RATE = 2e-5
WEIGHT_DECAY = 1e-4
WARMUP_RATIO = 0.1

# Early stopping
PATIENCE = 5

# Text preprocessing
MAX_TEXT_LENGTH = 105

# Other
NUM_WORKERS = 0
PIN_MEMORY = True if torch.cuda.is_available() else False

print("Config:")
print("  TEXT_MODEL_NAME: ", TEXT_MODEL_NAME)
print("  VISION_MODEL_NAME:", VISION_MODEL_NAME)
print("  BATCH_SIZE:", BATCH_SIZE)
print("  NUM_EPOCHS:", NUM_EPOCHS)
print("  LR:", LEARNING_RATE)
print("  WD:", WEIGHT_DECAY)
print("  WARMUP_RATIO:", WARMUP_RATIO)
print("  PATIENCE:", PATIENCE)
print("  MAX_TEXT_LENGTH:", MAX_TEXT_LENGTH)
print("  NUM_WORKERS:", NUM_WORKERS)

Config:
  TEXT_MODEL_NAME:  microsoft/deberta-v3-base
  VISION_MODEL_NAME: openai/clip-vit-base-patch32
  BATCH_SIZE: 16
  NUM_EPOCHS: 15
  LR: 2e-05
  WD: 0.0001
  WARMUP_RATIO: 0.1
  PATIENCE: 5
  MAX_TEXT_LENGTH: 105
  NUM_WORKERS: 0


###  Multimodal Dataset
We create a MultimodalDataset that will return:
- tokenized text (input_ids, attention_mask)
- image (Pixel Values PIL)
- label (stance)


We will handle corrupted images safely (Grey image).

In [5]:
class MultimodalDatasetCLIP(Dataset):
    """
    Dataset for CLIP + DeBERTa multimodal learning.
    Returns tokenized text + processed image + label.
    """
    def __init__(self, dataframe: pd.DataFrame, img_dir: str, tokenizer, processor, max_length: int = 128):

        self.df = dataframe.reset_index(drop=True)
        self.img_dir = img_dir
        self.tokenizer = tokenizer
        self.processor = processor 
        self.max_length = max_length

        # Class distribution
        self.class_counts = self.df['label'].value_counts().to_dict()
        self.num_samples = len(self.df)
        print(f"  Dataset created: {self.num_samples} samples")
        print(f"    - Class 0 (oppose):  {self.class_counts.get(0, 0)}")
        print(f"    - Class 1 (support): {self.class_counts.get(1, 0)}")

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        row = self.df.iloc[idx]

        # Load Image
        img_path = os.path.join(self.img_dir, str(row['tweet_id']) + ".jpg")
        try:
            image = Image.open(img_path).convert('RGB')
        except Exception as e:
            # fallback to grey image
            image = Image.new("RGB", (224, 224), color=(0, 0, 0))

        # Text
        text = str(row['tweet_text'])
        encoding = self.tokenizer(text,add_special_tokens=True,
                                  max_length=self.max_length,padding='max_length',
                                  truncation=True,return_tensors='pt')

        # Label
        label = row['label']

        return {
            'pixel_values': image,
            'input_ids': encoding['input_ids'].squeeze(0),
            'attention_mask': encoding['attention_mask'].squeeze(0),
            'label': torch.tensor(label, dtype=torch.long),
            'tweet_id': str(row['tweet_id']),
            'text': text
        }

    def get_class_weights(self, device):
        num_class_0 = self.class_counts.get(0, 0)
        num_class_1 = self.class_counts.get(1, 0)
        total = self.num_samples
        weights = torch.tensor([
            total / num_class_0 if num_class_0 > 0 else 1.0,
            total / num_class_1 if num_class_1 > 0 else 1.0], dtype=torch.float32).to(device)
        return weights

In [6]:
# Collate function for DataLoader
def collate_fn_clip(batch, processor):
    images = [item['pixel_values'] for item in batch]
    labels = torch.stack([item['label'] for item in batch])
    input_ids = torch.stack([item['input_ids'] for item in batch])
    attention_mask = torch.stack([item['attention_mask'] for item in batch])

    # Process images
    processed = processor(images=images, return_tensors="pt")
    
    return {
        'pixel_values': processed['pixel_values'],  # tensor [B, 3, 224, 224]
        'input_ids': input_ids,
        'attention_mask': attention_mask,
        'labels': labels}


In [7]:
# Tokenizer and Processor
tokenizer = AutoTokenizer.from_pretrained(TEXT_MODEL_NAME)
clip_processor = CLIPProcessor.from_pretrained(VISION_MODEL_NAME)
print("Tokenizer and Processor loaded.\n")

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


Tokenizer and Processor loaded.



In [8]:
train_dataset = MultimodalDatasetCLIP(df_train, IMG_PATH, tokenizer, clip_processor, MAX_TEXT_LENGTH)
dev_dataset   = MultimodalDatasetCLIP(df_dev, IMG_PATH, tokenizer, clip_processor, MAX_TEXT_LENGTH)
test_dataset  = MultimodalDatasetCLIP(df_test, IMG_PATH, tokenizer, clip_processor, MAX_TEXT_LENGTH)

train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True, num_workers=NUM_WORKERS,
                          pin_memory=PIN_MEMORY, collate_fn=lambda batch: collate_fn_clip(batch, clip_processor))
dev_loader = DataLoader(dev_dataset, batch_size=BATCH_SIZE, shuffle=False, num_workers=NUM_WORKERS,
                        pin_memory=PIN_MEMORY, collate_fn=lambda batch: collate_fn_clip(batch, clip_processor))
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False, num_workers=NUM_WORKERS,
                         pin_memory=PIN_MEMORY, collate_fn=lambda batch: collate_fn_clip(batch, clip_processor))


  Dataset created: 2190 samples
    - Class 0 (oppose):  1095
    - Class 1 (support): 1095
  Dataset created: 200 samples
    - Class 0 (oppose):  127
    - Class 1 (support): 73
  Dataset created: 300 samples
    - Class 0 (oppose):  182
    - Class 1 (support): 118


In [9]:
# Quick sanity check batch
batch = next(iter(train_loader))
print("Batch keys:", batch.keys())
print("pixel_values:", batch["pixel_values"].shape)
print("input_ids:", batch["input_ids"].shape)
print("attention_mask:", batch["attention_mask"].shape)
print("labels:", batch["labels"].shape)

Batch keys: dict_keys(['pixel_values', 'input_ids', 'attention_mask', 'labels'])
pixel_values: torch.Size([16, 3, 224, 224])
input_ids: torch.Size([16, 105])
attention_mask: torch.Size([16, 105])
labels: torch.Size([16])


In [10]:
class MultimodalBaseline(nn.Module):
    def __init__(
        self,
        text_model_name="microsoft/deberta-v3-base",
        vision_model_name="openai/clip-vit-base-patch32",
        num_classes=2,
        freeze_text=False,
        freeze_vision=True,
        fusion_type="mean",
        dropout=0.1
    ):
        super().__init__()
        self.fusion_type = fusion_type


        # TEXT ENCODER
        print(f"Loading TEXT encoder: {text_model_name}")
        self.text_encoder = AutoModel.from_pretrained(text_model_name)
        self.text_hidden = self.text_encoder.config.hidden_size  # e.g., 768

        if freeze_text:
            for p in self.text_encoder.parameters():
                p.requires_grad = False
            print("Text encoder FROZEN")


        # VISION ENCODER (CLIP)
        print(f"Loading VISION encoder: {vision_model_name}")
        self.clip_model = CLIPModel.from_pretrained(vision_model_name)
        self.vision_model = self.clip_model.vision_model
        self.clip_hidden = self.clip_model.config.projection_dim  # 512

        if freeze_vision:
            for p in self.vision_model.parameters():
                p.requires_grad = False
            # Optionally unfreeze last layer
            for name, param in self.vision_model.named_parameters():
                if "encoder.layers.11" in name:
                    param.requires_grad = True
            print("CLIP encoder partially frozen (last layer trainable)")

        # Projections
        
        # Project vision embeddings to text_hidden (needed for mean & gated & proj_concat)
        self.vision_proj = nn.Linear(self.clip_hidden, self.text_hidden)

        # Gated fusion
        if fusion_type == "gated":
            self.gate = nn.Sequential(
                nn.Linear(self.text_hidden * 2, self.text_hidden),
                nn.Sigmoid()
            )

        # ----------------
        # CLASSIFIER
        # ----------------
        if fusion_type == "concat":
            fused_dim = self.text_hidden + self.clip_hidden
        elif fusion_type == "mean":
            fused_dim = self.text_hidden
        elif fusion_type == "gated":
            fused_dim = self.text_hidden
        elif fusion_type == "proj_concat":
            fused_dim = self.text_hidden * 2
        else:
            raise ValueError(f"Unknown fusion_type: {fusion_type}")

        self.classifier = nn.Sequential(
            nn.Dropout(dropout),
            nn.Linear(fused_dim, num_classes))

        print(f"MultimodalBaseline initialized | Fusion={fusion_type} | fused_dim={fused_dim}")

    def forward(self, input_ids, attention_mask, images=None, mode="multimodal"):
 
        # TEXT EMBEDDINGS
        text_out = self.text_encoder(input_ids=input_ids, attention_mask=attention_mask)
        text_emb = text_out.last_hidden_state[:, 0, :]  # CLS token [B, text_hidden]


        # VISION EMBEDDINGS
        if mode == "multimodal" and images is not None:
            vision_feat = self.vision_model(pixel_values=images)
            cls_embedding = vision_feat.last_hidden_state[:, 0, :]  # CLS token
            vision_emb = self.clip_model.visual_projection(cls_embedding)  # [B, clip_hidden]
        else:
            vision_emb = torch.zeros(text_emb.size(0), self.clip_hidden, device=text_emb.device)

       
        # EARLY FUSION
        if self.fusion_type == "concat":
            fused = torch.cat([text_emb, vision_emb], dim=1)
        elif self.fusion_type == "mean":
            vision_emb_proj = self.vision_proj(vision_emb)  # [B, text_hidden]
            fused = (text_emb + vision_emb_proj) / 2
        elif self.fusion_type == "gated":
            vision_emb_proj = self.vision_proj(vision_emb)  # [B, text_hidden]
            gate_input = torch.cat([text_emb, vision_emb_proj], dim=1)  # [B, 2*text_hidden]
            gate = self.gate(gate_input)  # [B, text_hidden], sigmoid outputs 0-1
            fused = gate * text_emb + (1 - gate) * vision_emb_proj  # [B, text_hidden]
        elif self.fusion_type == "proj_concat":
            fused = torch.cat([text_emb, self.vision_proj(vision_emb)], dim=1)
        else:
            raise ValueError(f"Unknown fusion_type: {self.fusion_type}")

        # CLASSIFIER
        logits = self.classifier(fused)
        return logits


### Training 

In [11]:
def train_multimodal_model(
    model,
    train_loader,
    dev_loader,
    num_epochs=10,
    learning_rate=2e-5,
    weight_decay=1e-4,
    warmup_ratio=0.1,
    mode="multimodal",
    patience=3,
    device=DEVICE,
    save_path=None
):
    model = model.to(device)

    # -------------------------
    # CLASS WEIGHTS (IMBALANCE)
    # -------------------------
    class_weights = train_loader.dataset.get_class_weights(device)
    print(
        f"Class weights: "
        f"oppose={class_weights[0]:.3f}, "
        f"support={class_weights[1]:.3f}"
    )

    criterion = nn.CrossEntropyLoss(weight=class_weights)

    # -------------------------
    # OPTIMIZER (PARAM GROUPS)
    # -------------------------

    optimizer = torch.optim.AdamW(
        [
            {
                "params": model.text_encoder.parameters(),
                "lr": learning_rate
            },
            {
                "params": model.vision_model.parameters(),
                "lr": learning_rate * 0.5
            },
            {
                "params": model.classifier.parameters(),
                "lr": learning_rate * 2
            }
        ],
        weight_decay=weight_decay
    )

    # -------------------------
    # SCHEDULER
    # -------------------------
    num_training_steps = len(train_loader) * num_epochs
    num_warmup_steps = int(num_training_steps * warmup_ratio)

    scheduler = get_linear_schedule_with_warmup(
        optimizer,
        num_warmup_steps=num_warmup_steps,
        num_training_steps=num_training_steps
    )

    print(f"Total training steps: {num_training_steps}")
    print(f"Warmup steps: {num_warmup_steps}")

    # -------------------------
    # HISTORY
    # -------------------------
    history = {
        "train_loss": [],
        "train_f1": [],
        "dev_loss": [],
        "dev_f1": [],
        "learning_rates": []
    }

    # -------------------------
    # EARLY STOPPING
    # -------------------------
    best_f1 = 0.0
    best_model_state = None
    epochs_without_improvement = 0

    # =========================
    # TRAINING LOOP
    # =========================
    for epoch in range(num_epochs):
        print(f"\n{'=' * 60}")
        print(f"Epoch {epoch + 1}/{num_epochs}")
        print(f"{'=' * 60}")

        # -------- TRAIN --------
        model.train()
        train_loss = 0.0
        train_preds, train_labels = [], []

        for batch in tqdm(train_loader, desc="Training"):
            optimizer.zero_grad()

            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            images = batch["pixel_values"].to(device)
            labels = batch["labels"].to(device)

            if mode == "text_only":
                logits = model(input_ids=input_ids, attention_mask=attention_mask, images=None, mode="text_only")
            else:
                logits = model(input_ids=input_ids, attention_mask=attention_mask, images=images, mode="multimodal")

            loss = criterion(logits, labels)
            loss.backward()

            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

            optimizer.step()
            scheduler.step()

            train_loss += loss.item() * labels.size(0)
            preds = torch.argmax(logits, dim=1)

            train_preds.extend(preds.cpu().numpy())
            train_labels.extend(labels.cpu().numpy())

        train_loss /= len(train_loader.dataset)
        train_f1 = f1_score(train_labels,train_preds,average="binary",pos_label=1,zero_division=0)

        # -------- VALIDATION --------
        model.eval()
        dev_loss = 0.0
        dev_preds, dev_labels = [], []

        with torch.no_grad():
            for batch in tqdm(dev_loader, desc="Validation", leave=False):
                input_ids = batch["input_ids"].to(device)
                attention_mask = batch["attention_mask"].to(device)
                images = batch["pixel_values"].to(device)
                labels = batch["labels"].to(device)

                if mode == "text_only":
                    logits = model(input_ids=input_ids, attention_mask=attention_mask, images=None, mode="text_only")
                else:
                    logits = model(input_ids=input_ids, attention_mask=attention_mask, images=images, mode="multimodal")

                loss = criterion(logits, labels)
                dev_loss += loss.item() * labels.size(0)

                preds = torch.argmax(logits, dim=1)
                dev_preds.extend(preds.cpu().numpy())
                dev_labels.extend(labels.cpu().numpy())

        dev_loss /= len(dev_loader.dataset)
        dev_f1 = f1_score(dev_labels,dev_preds,average="binary",pos_label=1,zero_division=0)

        current_lr = scheduler.get_last_lr()[0]

        history["train_loss"].append(train_loss)
        history["train_f1"].append(train_f1)
        history["dev_loss"].append(dev_loss)
        history["dev_f1"].append(dev_f1)
        history["learning_rates"].append(current_lr)

        print(f"TRAIN LOSS: {train_loss:.4f} | F1: {train_f1:.4f}")
        print(f"DEV   LOSS: {dev_loss:.4f} | F1: {dev_f1:.4f}")
        print(f"LR: {current_lr:.2e}")

        # -------- EARLY STOPPING --------
        if dev_f1 > best_f1:
            best_f1 = dev_f1
            best_model_state = {k: v.detach().cpu().clone() for k, v in model.state_dict().items()}
            epochs_without_improvement = 0
            print(f"  New best DEV F1: {best_f1:.4f}")

            if save_path is not None:
                torch.save(best_model_state, save_path)
                print(f"   Saved best checkpoint: {save_path}")

        else:
            epochs_without_improvement += 1
            print(f"   No improvement ({epochs_without_improvement}/{patience})")

            if epochs_without_improvement >= patience:
                print("  Early stopping triggered.")
                break

    if best_model_state is None and save_path is not None and os.path.exists(save_path):
        print("[WARN] best_model_state is None; loading from disk checkpoint.")
        best_model_state = torch.load(save_path, map_location="cpu")
    
    if best_model_state is not None:
        model.load_state_dict(best_model_state)
    
    model = model.to(device)
    print(f"\nBest DEV F1: {best_f1:.4f}")
    
    return model, history

### Evaluation 

In [12]:
def evaluate_argmax(model, dataloader, mode="multimodal", device=DEVICE, verbose=True):
    model.eval()
    all_labels = []
    all_preds = []

    with torch.no_grad():
        for batch in tqdm(dataloader, desc="Evaluating", disable=not verbose):
            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            images = batch["pixel_values"].to(device)
            labels = batch["labels"].to(device)

            if mode == "text_only":
                logits = model(input_ids=input_ids, attention_mask=attention_mask, images=None, mode="text_only")
            else:
                logits = model(input_ids=input_ids, attention_mask=attention_mask, images=images, mode="multimodal")

            preds = torch.argmax(logits, dim=1)

            all_labels.extend(labels.cpu().numpy())
            all_preds.extend(preds.cpu().numpy())

    y_true = np.array(all_labels)
    y_pred = np.array(all_preds)

    results = {
        "accuracy": accuracy_score(y_true, y_pred),
        "precision": precision_score(y_true, y_pred, average="binary", pos_label=1, zero_division=0),
        "recall": recall_score(y_true, y_pred, average="binary", pos_label=1, zero_division=0),
        "f1": f1_score(y_true, y_pred, average="binary", pos_label=1, zero_division=0),
        "confusion_matrix": confusion_matrix(y_true, y_pred),
    }
    return results

## Trainning & Testing

In [13]:
print("TRAINING (replication run) — DeBERTa + CLIP | fusion=mean | freeze_vision=True")

fusion = "mean"
model = MultimodalBaseline(
    text_model_name=TEXT_MODEL_NAME,
    vision_model_name=VISION_MODEL_NAME,
    num_classes=2,
    freeze_text=False,
    freeze_vision=False,
    fusion_type=fusion,
    dropout=0.1
).to(DEVICE)

ckpt_path = os.path.join(OUTPUT_DIR, f"best_deberta_clip_{fusion}.pth")

model, history = train_multimodal_model(
    model=model,
    train_loader=train_loader,
    dev_loader=dev_loader,
    num_epochs=NUM_EPOCHS,
    learning_rate=LEARNING_RATE,
    weight_decay=WEIGHT_DECAY,
    warmup_ratio=WARMUP_RATIO,
    mode="multimodal",
    patience=PATIENCE,
    device=DEVICE,
    save_path=ckpt_path
)

print("\nEVALUATION ON TEST (argmax):")
test_res = evaluate_argmax(model, test_loader, mode="multimodal", device=DEVICE, verbose=False)
print(f"F1: {test_res['f1']:.4f} | P: {test_res['precision']:.4f} | R: {test_res['recall']:.4f} | Acc: {test_res['accuracy']:.4f}")

TRAINING (replication run) — DeBERTa + CLIP | fusion=mean | freeze_vision=True
Loading TEXT encoder: microsoft/deberta-v3-base
Loading VISION encoder: openai/clip-vit-base-patch32
MultimodalBaseline initialized | Fusion=mean | fused_dim=768
Class weights: oppose=2.000, support=2.000
Total training steps: 2055
Warmup steps: 205

Epoch 1/15


Training:   0%|          | 0/137 [00:00<?, ?it/s]

Validation:   0%|          | 0/13 [00:00<?, ?it/s]

TRAIN LOSS: 0.6008 | F1: 0.5071
DEV   LOSS: 0.4879 | F1: 0.6486
LR: 1.34e-05
  New best DEV F1: 0.6486
   Saved best checkpoint: ../../results/multimodal/baseline_multimodal/best_deberta_clip_mean.pth

Epoch 2/15


Training:   0%|          | 0/137 [00:00<?, ?it/s]

Validation:   0%|          | 0/13 [00:00<?, ?it/s]

TRAIN LOSS: 0.2753 | F1: 0.8938
DEV   LOSS: 0.2836 | F1: 0.8372
LR: 1.93e-05
  New best DEV F1: 0.8372
   Saved best checkpoint: ../../results/multimodal/baseline_multimodal/best_deberta_clip_mean.pth

Epoch 3/15


Training:   0%|          | 0/137 [00:00<?, ?it/s]

Validation:   0%|          | 0/13 [00:00<?, ?it/s]

TRAIN LOSS: 0.1661 | F1: 0.9444
DEV   LOSS: 0.2596 | F1: 0.8741
LR: 1.78e-05
  New best DEV F1: 0.8741
   Saved best checkpoint: ../../results/multimodal/baseline_multimodal/best_deberta_clip_mean.pth

Epoch 4/15


Training:   0%|          | 0/137 [00:00<?, ?it/s]

Validation:   0%|          | 0/13 [00:00<?, ?it/s]

TRAIN LOSS: 0.0631 | F1: 0.9804
DEV   LOSS: 0.4011 | F1: 0.8676
LR: 1.63e-05
   No improvement (1/5)

Epoch 5/15


Training:   0%|          | 0/137 [00:00<?, ?it/s]

Validation:   0%|          | 0/13 [00:00<?, ?it/s]

TRAIN LOSS: 0.0200 | F1: 0.9936
DEV   LOSS: 0.4407 | F1: 0.8857
LR: 1.48e-05
  New best DEV F1: 0.8857
   Saved best checkpoint: ../../results/multimodal/baseline_multimodal/best_deberta_clip_mean.pth

Epoch 6/15


Training:   0%|          | 0/137 [00:00<?, ?it/s]

Validation:   0%|          | 0/13 [00:00<?, ?it/s]

TRAIN LOSS: 0.0136 | F1: 0.9973
DEV   LOSS: 0.3669 | F1: 0.9167
LR: 1.33e-05
  New best DEV F1: 0.9167
   Saved best checkpoint: ../../results/multimodal/baseline_multimodal/best_deberta_clip_mean.pth

Epoch 7/15


Training:   0%|          | 0/137 [00:00<?, ?it/s]

Validation:   0%|          | 0/13 [00:00<?, ?it/s]

TRAIN LOSS: 0.0055 | F1: 0.9991
DEV   LOSS: 0.4667 | F1: 0.8794
LR: 1.18e-05
   No improvement (1/5)

Epoch 8/15


Training:   0%|          | 0/137 [00:00<?, ?it/s]

Validation:   0%|          | 0/13 [00:00<?, ?it/s]

TRAIN LOSS: 0.0060 | F1: 0.9982
DEV   LOSS: 0.5885 | F1: 0.8333
LR: 1.04e-05
   No improvement (2/5)

Epoch 9/15


Training:   0%|          | 0/137 [00:00<?, ?it/s]

Validation:   0%|          | 0/13 [00:00<?, ?it/s]

TRAIN LOSS: 0.0012 | F1: 0.9995
DEV   LOSS: 0.4901 | F1: 0.8777
LR: 8.89e-06
   No improvement (3/5)

Epoch 10/15


Training:   0%|          | 0/137 [00:00<?, ?it/s]

Validation:   0%|          | 0/13 [00:00<?, ?it/s]

TRAIN LOSS: 0.0026 | F1: 0.9991
DEV   LOSS: 0.5368 | F1: 0.8696
LR: 7.41e-06
   No improvement (4/5)

Epoch 11/15


Training:   0%|          | 0/137 [00:00<?, ?it/s]

Validation:   0%|          | 0/13 [00:00<?, ?it/s]

TRAIN LOSS: 0.0012 | F1: 0.9991
DEV   LOSS: 0.7076 | F1: 0.8031
LR: 5.92e-06
   No improvement (5/5)
  Early stopping triggered.

Best DEV F1: 0.9167

EVALUATION ON TEST (argmax):
F1: 0.8178 | P: 0.7285 | R: 0.9322 | Acc: 0.8367


Best Model:

- Fusion Mean

F1-Score:

- 84.29 %