<center><br><br>
<font size=6>🎓 <b>Advanced Deep Learning - NLP Final Project</b></font><br>
<font size=6>⚖️  <b>Ensembeling - Best Models</b></font><br>
<font size=5>👥 <b>Group W</b></font><br><br>
<b>Adi Shalit</b>, ID: <code>206628885</code><br>
<b>Gal Gussarsky</b>, ID: <code>206453540</code><br><br>
<font size=4>📘 Course ID: <code>05714184</code></font><br>
<font size=4>📅 Spring 2025</font>
<br><br>
<hr style="width:60%; border:1px solid gray;"></center>


In [14]:
# !ls -lh "/content/drive/MyDrive/DL_2_Project/Models_Adi/microsoft__mdeberta-v3-base_full_ex_4_20250817_133620"


total 2.1G
-rw------- 1 root root   26 Aug 17 10:36 added_tokens.json
-rw------- 1 root root  242 Aug 17 10:36 best_hparams_ex4.json
-rw------- 1 root root 1.1G Aug 17 10:21 best_state_dict.pt
-rw------- 1 root root  641 Aug 17 10:36 classification_report_test.csv
-rw------- 1 root root 1.2K Aug 17 10:36 config.json
-rw------- 1 root root  255 Aug 17 10:36 confusion_matrix_test.csv
-rw------- 1 root root  418 Aug 17 10:36 labels.json
-rw------- 1 root root 1.1G Aug 17 10:36 model.safetensors
-rw------- 1 root root  956 Aug 17 10:36 README.txt
-rw------- 1 root root  301 Aug 17 10:36 special_tokens_map.json
-rw------- 1 root root 4.2M Aug 17 10:36 spm.model
-rw------- 1 root root  173 Aug 17 10:36 test_metrics.json
-rw------- 1 root root  21K Aug 17 10:36 tokenizer_config.json
-rw------- 1 root root  16M Aug 17 10:36 tokenizer.json


# Ensemble Extension – Motivation & Rationale

In this part of the project, we extend beyond single-model fine-tuning and explore the use of **ensembles**.  
So far, we trained and optimized several RoBERTa and DeBERTa variants, each reaching strong performance on the sentiment classification task.  
However, individual models can still make **slightly different mistakes**.  

The key idea:  
> By combining predictions from multiple models, we can “average out” these errors and obtain a more **robust and stable** prediction.

### Why Ensemble?
- Each model has its own inductive bias and error patterns.  
- Averaging logits or voting reduces variance across runs.  
- Often leads to a small but consistent **boost in accuracy and F1**.  

### What We Do
- Load the **best-performing RoBERTa (.pt checkpoints)** and **DeBERTa (Trainer checkpoints)** from previous experiments.  
- Normalize label spaces and ensure consistent prediction ordering.  
- Combine predictions through **logit averaging** across models.  
- Evaluate the ensemble on the test set and compare to individual models.  

### Goal
This is an **extra step** in the project, aiming to check whether ensemble methods can push our performance beyond the best single model, and demonstrate the general benefit of model combination strategies in NLP tasks.  


In [21]:
# === RoBERTa (.pt) + DeBERTa (Trainer + Manual) Ensemble ===
import os, torch, json
import pandas as pd
import numpy as np
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from sklearn.metrics import classification_report
from google.colab import drive

# # -------------------------
# # Mount Drive
# # -------------------------
# drive.mount("/content/drive")

# -------------------------
# Constants
# -------------------------
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
MAX_LEN = 512
BATCH_SIZE = 8

OUT_DIR = "EnsembleOutputs"
os.makedirs(OUT_DIR, exist_ok=True)

# Canonical label order
ORDER = ["extremely negative", "negative", "neutral", "positive", "extremely positive"]
LABEL2ID = {lab: i for i, lab in enumerate(ORDER)}
ID2LABEL = {i: lab for i, lab in enumerate(ORDER)}

# roberta_set2 training-time label order (must be remapped)
TRAIN_ORDER_SET2 = ["Neutral", "Positive", "Extremely Negative", "Negative", "Extremely Positive"]
TRAIN_ORDER_SET2 = [x.strip().lower() for x in TRAIN_ORDER_SET2]
LOGIT_REORDER_MAP = [ORDER.index(label) for label in TRAIN_ORDER_SET2]

# -------------------------
# Data Prep
# -------------------------
def normalize_label(s: str) -> str:
    s = str(s).strip().lower()
    s = s.replace("very negative", "extremely negative")
    s = s.replace("very positive", "extremely positive")
    s = s.replace("extreme negative", "extremely negative")
    s = s.replace("extreme positive", "extremely positive")
    return s

def prep_df(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    df = df.dropna(subset=["OriginalTweet", "Sentiment"])
    df["text"] = df["OriginalTweet"].astype(str).str.strip()
    df["label_name"] = df["Sentiment"].apply(normalize_label)
    df = df[df["label_name"].isin(ORDER)].reset_index(drop=True)
    df["label"] = df["label_name"].map(LABEL2ID)
    return df[["text", "label", "label_name"]]

df_test = pd.read_csv("test_cleaned_translated.csv")
test_df = prep_df(df_test)

# -------------------------
# Load Models
# -------------------------
models = {}

# --- 1. RoBERTa manual (.pt)
rb1_tok = AutoTokenizer.from_pretrained("roberta-base")
rb1_mod = AutoModelForSequenceClassification.from_pretrained("roberta-base", num_labels=len(ORDER))
rb1_mod.load_state_dict(torch.load(
    "adv_dl_models_final/roberta_base_best_manual.pt",
    map_location=DEVICE
))
rb1_mod.to(DEVICE).eval()
models["roberta_ex4"] = (rb1_mod, rb1_tok)

# --- 2. RoBERTa set2 (.pt) – requires logit remapping
rb2_tok = AutoTokenizer.from_pretrained("roberta-base")
rb2_mod = AutoModelForSequenceClassification.from_pretrained("roberta-base", num_labels=len(ORDER))
rb2_mod.load_state_dict(torch.load(
    "adv_dl_models_final2_best/roberta_base_best_set2.pt",
    map_location=DEVICE
))
rb2_mod.to(DEVICE).eval()
models["roberta_ex5"] = (rb2_mod, rb2_tok)

# --- 3. DeBERTa ex5 (Trainer → best_model subfolder)
deb3_path = "microsoft__mdeberta-v3-base_ex5_trainer-try__final_20250817_104013/best_model"
deb3_tok = AutoTokenizer.from_pretrained(deb3_path)
deb3_mod = AutoModelForSequenceClassification.from_pretrained(deb3_path).to(DEVICE).eval()
models["deberta_ex5"] = (deb3_mod, deb3_tok)

# --- 4. DeBERTa ex4 (folder itself is checkpoint)
deb4_path = "microsoft__mdeberta-v3-base_full_ex_4_20250817_133620"
deb4_tok = AutoTokenizer.from_pretrained(deb4_path)
deb4_mod = AutoModelForSequenceClassification.from_pretrained(
    deb4_path,
    num_labels=len(ORDER),
    id2label=ID2LABEL,
    label2id=LABEL2ID
).to(DEVICE).eval()
models["deberta_ex4"] = (deb4_mod, deb4_tok)

print(f"✅ Loaded {len(models)} models ({', '.join(models.keys())}).")

# -------------------------
# Predictions + Logits storage
# -------------------------
all_labels = []
ensemble_logits = []
all_model_outputs = {name: {"logits": [], "preds": []} for name in models}

with torch.no_grad():
    for start in range(0, len(test_df), BATCH_SIZE):
        texts = test_df["text"].tolist()[start:start+BATCH_SIZE]
        labels = test_df["label"].tolist()[start:start+BATCH_SIZE]
        all_labels.extend([int(x) for x in labels])

        logits_stack = []
        for name, (model, tok) in models.items():
            enc = tok(texts, truncation=True, max_length=MAX_LEN,
                      padding=True, return_tensors="pt").to(DEVICE)
            logits = model(**enc).logits

            # 🔧 Fix: remap class order for roberta_ex5 (set2 model)
            if name == "roberta_ex5":
                logits = logits[:, LOGIT_REORDER_MAP]

            preds = logits.argmax(dim=-1).cpu().numpy().astype(int).tolist()
            all_model_outputs[name]["logits"].extend(logits.cpu().numpy().tolist())
            all_model_outputs[name]["preds"].extend(preds)
            logits_stack.append(logits)

        avg_logits = torch.mean(torch.stack(logits_stack), dim=0)
        ensemble_logits.append(avg_logits.cpu())

# -------------------------
# Final ensemble predictions
# -------------------------
ensemble_logits = torch.cat(ensemble_logits)
ensemble_preds = ensemble_logits.argmax(dim=-1).cpu().numpy().astype(int).tolist()

# -------------------------
# Save everything
# -------------------------
out_path = os.path.join(OUT_DIR, "ensemble_roberta_deberta_all.json")
with open(out_path, "w") as f:
    json.dump({
        "models": list(models.keys()),
        "labels": all_labels,
        "ensemble_preds": ensemble_preds,
        "per_model": all_model_outputs
    }, f)

print(f"\n📂 Saved logits + preds to {out_path}")

# -------------------------
# Report
# -------------------------
print("\n=== Ensemble Classification Report (4 models) ===\n")
print(classification_report(
    all_labels,
    ensemble_preds,
    target_names=ORDER,
    zero_division=0,
    digits=4
))


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


✅ Loaded 4 models (roberta_ex4, roberta_ex5, deberta_ex5, deberta_ex4).

📂 Saved logits + preds to /content/drive/MyDrive/DL_2_Project/EnsembleOutputs/ensemble_roberta_deberta_all.json

=== Ensemble Classification Report (4 models) ===

                    precision    recall  f1-score   support

extremely negative     0.8807    0.8851    0.8829       592
          negative     0.8381    0.8751    0.8562      1041
           neutral     0.9236    0.8401    0.8799       619
          positive     0.8424    0.8638    0.8530       947
extremely positive     0.8986    0.8731    0.8857       599

          accuracy                         0.8678      3798
         macro avg     0.8767    0.8674    0.8715      3798
      weighted avg     0.8693    0.8678    0.8681      3798



## 📊 Ensemble Results

**All 4 models (roberta_ex4, roberta_ex5, deberta_ex5, deberta_ex4):**

- **Accuracy:** 0.8678  
- **Macro F1:** 0.8715  

  



In [24]:
import json
import numpy as np
from sklearn.metrics import classification_report

# -------------------------
# Load saved predictions
# -------------------------
in_path = "EnsembleOutputs/ensemble_roberta_deberta_all.json"
with open(in_path, "r") as f:
    data = json.load(f)

labels = np.array(data["labels"])

# -------------------------
# Rebuild ensemble without roberta_ex4
# -------------------------
keep_models = ["roberta_ex5", "deberta_ex5", "deberta_ex4"]

# stack logits from the selected models
logit_arrays = []
for name in keep_models:
    arr = np.array(data["per_model"][name]["logits"])
    logit_arrays.append(arr)
logits_stack = np.stack(logit_arrays, axis=0)

# average ensemble
avg_logits = np.mean(logits_stack, axis=0)
ensemble_preds = avg_logits.argmax(axis=-1)

# -------------------------
# Report
# -------------------------
print(f"=== Ensemble Report (excluding roberta_ex4) ===\n")
print(classification_report(
    labels,
    ensemble_preds,
    target_names=["extremely negative","negative","neutral","positive","extremely positive"],
    digits=4,
    zero_division=0
))


=== Ensemble Report (excluding roberta_ex4) ===

                    precision    recall  f1-score   support

extremely negative     0.8846    0.8936    0.8891       592
          negative     0.8460    0.8866    0.8659      1041
           neutral     0.9133    0.8514    0.8813       619
          positive     0.8661    0.8469    0.8564       947
extremely positive     0.8845    0.8948    0.8896       599

          accuracy                         0.8734      3798
         macro avg     0.8789    0.8747    0.8764      3798
      weighted avg     0.8741    0.8734    0.8734      3798



## 🔄 Ensemble Without Weakest Model (RoBERTa ex4)

Since **RoBERTa ex4** showed the lowest accuracy (0.77), we excluded it from the ensemble.  
The idea is that removing weaker models can reduce noise and further boost performance.

**Result (3-model ensemble: RoBERTa ex5 + DeBERTa ex5 + DeBERTa ex4):**

- **Accuracy:** 0.8734  
- **Macro F1:** 0.8764

We will now compare between all models.

In [10]:
from sklearn.metrics import classification_report, accuracy_score
import json, os

out_path = "EnsembleOutputs/ensemble_roberta_deberta_all.json"

with open(out_path, "r") as f:
    data = json.load(f)

labels = data["labels"]

for name in ["roberta_ex4", "roberta_ex5", "deberta_ex5", "deberta_ex4"]:
    preds = data["per_model"][name]["preds"]
    acc = accuracy_score(labels, preds)
    print(f"🔹 {name} accuracy: {acc:.4f}")
    print(classification_report(labels, preds, target_names=ORDER, digits=4, zero_division=0))
    print("-" * 60)


FileNotFoundError: [Errno 2] No such file or directory: 'EnsembleOutputs/ensemble_roberta_deberta_all.json'

# 📊 Model Comparison: Single Models vs Ensembles

We compare the performance of each **individual model** against the **ensembles** (all 4 models vs 3 models without the weakest).  

---

## 🔹 Results Overview

| Model / Ensemble          | Accuracy | Macro F1 | Notes |
|---------------------------|----------|----------|-------|
| **RoBERTa ex4**           | 0.7725   | 0.7794   | Weakest model, clear underperformer |
| **RoBERTa ex5**           | 0.8354   | 0.8402   | Strong improvement over ex4 |
| **DeBERTa ex5**           | 0.8439   | 0.8479   | Balanced performance |
| **DeBERTa ex4**           | 0.8662   | 0.8686   | Best single model |
| **Ensemble (4 models)**   | 0.8678   | 0.8715   | Slightly improves over best single model |
| **Ensemble (3 models, no ex4)** | **0.8734** | **0.8764** | Best overall, removing weak model helps |

---

## 🔎 Insights

1. **RoBERTa ex4 underperforms significantly** (Acc ~0.77), pulling down the 4-model ensemble.  
2. **DeBERTa ex4** is the strongest single model (Acc 0.8662 / F1 0.8686).  
3. The **4-model ensemble** improves slightly over DeBERTa ex4, but the **3-model ensemble (without ex4)** is best overall.  
4. This confirms that **ensembles can boost performance**, but only when weaker models are excluded.  
5. **Final recommendation:** Use the **3-model ensemble (RoBERTa ex5 + DeBERTa ex5 + DeBERTa ex4)** as the final model – highest accuracy (0.8734) and best macro F1 (0.8764).  

---


## 🗳️ Strict Majority Voting Ensemble

In this step, we test a different ensemble strategy:  
instead of averaging logits, we apply **strict majority voting** across model predictions.  

- Each sample’s prediction is chosen by the **most frequent class label** across selected models.  
- We compare two setups:  
  1. **WITH RoBERTa ex4** (all 4 models included)  
  2. **WITHOUT RoBERTa ex4** (exclude the weakest model)  

The goal is to check whether majority voting improves robustness compared to probability averaging,  
and whether excluding the weaker model (RoBERTa ex4) leads to better overall results.  


In [25]:
import numpy as np
from sklearn.metrics import classification_report

labels = np.array(data["labels"])
all_preds = {name: np.array(data["per_model"][name]["preds"]) for name in data["per_model"]}

def strict_majority(preds_dict, keep_models):
    preds_stack = np.stack([all_preds[m] for m in keep_models], axis=0)
    maj_preds = []
    for i in range(preds_stack.shape[1]):
        votes = preds_stack[:, i]
        # majority vote
        values, counts = np.unique(votes, return_counts=True)
        maj = values[np.argmax(counts)]
        maj_preds.append(maj)
    return np.array(maj_preds)

for keep_models, tag in [
    (["roberta_ex4","roberta_ex5","deberta_ex5","deberta_ex4"], "WITH roberta_ex4"),
    (["roberta_ex5","deberta_ex5","deberta_ex4"], "WITHOUT roberta_ex4")
]:
    preds = strict_majority(all_preds, keep_models)
    print(f"\n=== Strict Majority Voting ({tag}) ===")
    print(classification_report(labels, preds, target_names=ORDER, digits=4, zero_division=0))



=== Strict Majority Voting (WITH roberta_ex4) ===
                    precision    recall  f1-score   support

extremely negative     0.8418    0.9257    0.8817       592
          negative     0.8375    0.8617    0.8494      1041
           neutral     0.9153    0.8384    0.8752       619
          positive     0.8381    0.8585    0.8482       947
extremely positive     0.9314    0.8381    0.8822       599

          accuracy                         0.8633      3798
         macro avg     0.8728    0.8645    0.8674      3798
      weighted avg     0.8658    0.8633    0.8635      3798


=== Strict Majority Voting (WITHOUT roberta_ex4) ===
                    precision    recall  f1-score   support

extremely negative     0.8718    0.9071    0.8891       592
          negative     0.8412    0.8857    0.8629      1041
           neutral     0.9253    0.8401    0.8806       619
          positive     0.8660    0.8395    0.8525       947
extremely positive     0.8828    0.8932    0.8880  

## 📊 Strict Majority Voting Results

### WITH RoBERTa ex4 (all 4 models)
- **Accuracy:** 0.8633  
- **Macro F1:** 0.8674  
- Strong recall on *extremely negative* (0.93)  
- Slightly weaker on *extremely positive* recall (0.84)

### WITHOUT RoBERTa ex4 (3 models only)
- **Accuracy:** 0.8712  
- **Macro F1:** 0.8746  
- Balanced performance:  
  - *Negative* recall ↑ (0.89 vs. 0.86)  
  - *Positive* precision ↑ (0.87 vs. 0.84)  
  - More consistent across all 5 classes  

---

### 🔎 Insights
- Removing **RoBERTa ex4** (the weakest individual model) **improves overall performance**, raising accuracy and macro F1 by ~0.8%.  
- The 3-model ensemble provides **better balance across classes**, avoiding the slight trade-offs observed with all 4 models.  
- Both ensembles outperform any **single model**, confirming the benefit of ensembling.  


## 🗳️ Agreement-Weighted Voting (Hard Predictions)

In this step we test a **more advanced ensemble strategy**.  
Instead of giving each model an equal vote (majority voting), we **weight models based on how much they agree with the others**:  

- For every pair of models, we calculate their **disagreement rate** (how often their predictions differ).  
- Models that **disagree less with others** are assigned **higher weights**, since they are considered more reliable and consistent.  
- During voting, each model’s prediction contributes proportionally to its weight.  

This way, the ensemble emphasizes models that are **more stable across the dataset**, while reducing the influence of models that are often in conflict with the rest.  

We evaluate this method both **with all four models** and **excluding the weakest (RoBERTa ex4)**, to see if stability-based weighting can further boost performance.


In [26]:
def agreement_weighted_preds_hard(all_preds, keep_models):
    # compute disagreement matrix
    preds_stack = np.stack([all_preds[m] for m in keep_models], axis=0)
    n_models = preds_stack.shape[0]
    N = preds_stack.shape[1]

    disagree = np.zeros((n_models,n_models))
    for i in range(n_models):
        for j in range(n_models):
            if i!=j:
                disagree[i,j] = np.mean(preds_stack[i]!=preds_stack[j])
    weights = 1.0 / (1.0 + disagree.sum(axis=1))   # smaller disagreement → bigger weight

    # weighted vote per sample
    final_preds=[]
    for k in range(N):
        votes = {}
        for i,m in enumerate(keep_models):
            c = preds_stack[i,k]
            votes[c] = votes.get(c,0)+weights[i]
        final_preds.append(max(votes, key=votes.get))
    return np.array(final_preds)

for keep_models, tag in [
    (["roberta_ex4","roberta_ex5","deberta_ex5","deberta_ex4"], "WITH roberta_ex4"),
    (["roberta_ex5","deberta_ex5","deberta_ex4"], "WITHOUT roberta_ex4")
]:
    preds = agreement_weighted_preds_hard(all_preds, keep_models)
    print(f"\n=== Agreement Weighted Voting (hard preds) ({tag}) ===")
    print(classification_report(labels, preds, target_names=ORDER, digits=4, zero_division=0))



=== Agreement Weighted Voting (hard preds) (WITH roberta_ex4) ===
                    precision    recall  f1-score   support

extremely negative     0.8730    0.9054    0.8889       592
          negative     0.8542    0.8780    0.8659      1041
           neutral     0.9142    0.8433    0.8773       619
          positive     0.8579    0.8479    0.8529       947
extremely positive     0.8814    0.8932    0.8872       599

          accuracy                         0.8715      3798
         macro avg     0.8761    0.8736    0.8745      3798
      weighted avg     0.8721    0.8715    0.8715      3798


=== Agreement Weighted Voting (hard preds) (WITHOUT roberta_ex4) ===
                    precision    recall  f1-score   support

extremely negative     0.8758    0.9054    0.8904       592
          negative     0.8573    0.8770    0.8670      1041
           neutral     0.8947    0.8514    0.8725       619
          positive     0.8647    0.8437    0.8541       947
extremely positive 

## 📊 Agreement-Weighted Voting (Hard Predictions) – Results

We applied **agreement-weighted voting**, where models that disagree less with the others receive higher weights.  
This ensures more consistent models have a stronger influence in the final decision.

### 🔹 Results

**WITH RoBERTa ex4**
- Accuracy: **0.8715**
- Macro F1: **0.8745**
- Strong performance across all classes, with especially high recall for *extremely negative* and *extremely positive*.

**WITHOUT RoBERTa ex4**
- Accuracy: **0.8718**
- Macro F1: **0.8744**
- Similar overall performance, slightly better balance in *negative* and *positive* classes.

### 🔍 Insights
- Agreement-weighted voting yields **stable performance (~87% accuracy)** regardless of including or excluding RoBERTa ex4.  
- Compared to simple majority voting, this method produces **more balanced per-class F1 scores**, showing that weighting by inter-model consistency helps smooth out conflicts.  
- Excluding the weakest model (RoBERTa ex4) does not significantly change results, suggesting the weighting already **down-weights weaker models** automatically.  


## 🧮 Agreement-Weighted Voting (Using Logits)

In this stage, we extend the **agreement-weighted voting** idea to use the **raw logits** (model confidence scores) instead of hard predictions.  
The key intuition is that logits contain more information than just the argmax class — they reflect how confident each model is across all classes.

### ⚙️ Method
1. **Disagreement Measurement**  
   - For each pair of models, compute the **mean squared difference** between their logits (per sample, per class).  
   - Models with smaller differences (i.e., more consistent predictions) are considered more reliable.

2. **Weight Assignment**  
   - Assign higher weights to models that disagree less with others.  
   - This way, stable models have stronger influence in the final decision.

3. **Weighted Logit Fusion**  
   - For each sample, compute the **weighted sum of logits** across models.  
   - The final prediction is the class with the highest weighted logit.

### 🎯 Why This Helps
- Unlike majority or hard-vote schemes, this method leverages **confidence information**.  
- By combining logits, the ensemble can capture **subtle agreement patterns** and reduce the impact of uncertain predictions.  
- This often results in **smoother and more accurate ensemble predictions**, especially when model confidence varies across classes.
  
### 📊 Results

**WITH roberta_ex4**
- Accuracy: **0.8726**  
- Macro F1: **0.8764**  
- Neutral class shows strong precision (**0.9306**) but slightly weaker recall (**0.8449**).  

**WITHOUT roberta_ex4**
- Accuracy: **0.8731**  
- Macro F1: **0.8762**  
- Performance is slightly more balanced, with improved **negative** class F1 (**0.8648**) and stronger **extremely positive** stability (**0.8911**).

---

### 🔎 Insights
- Removing the weaker **roberta_ex4** does not significantly change overall accuracy, but it **balances performance** across classes.  
- The logits-based ensemble performs at the same level as (or slightly better than) strict majority and hard-vote ensembles.  
- Overall, this suggests that **logit-level fusion is the most stable ensemble method** for our setup.

In [27]:
all_logits = {name: np.array(data["per_model"][name]["logits"]) for name in data["per_model"]}

def agreement_weighted_preds_logits(all_logits, keep_models):
    logits_stack = np.stack([all_logits[m] for m in keep_models], axis=0) # (n_models,N,C)
    n_models,N,C = logits_stack.shape

    # compute pairwise logit distances
    disagree = np.zeros((n_models,n_models))
    for i in range(n_models):
        for j in range(n_models):
            if i!=j:
                diff = logits_stack[i]-logits_stack[j]
                disagree[i,j]=np.mean((diff**2).sum(axis=-1))  # average mse

    weights = 1.0/(1.0+disagree.sum(axis=1))

    final_preds=[]
    for k in range(N):
        weighted_sum = np.zeros(C)
        for i in range(n_models):
            weighted_sum += weights[i]*logits_stack[i,k]
        final_preds.append(np.argmax(weighted_sum))
    return np.array(final_preds)

for keep_models, tag in [
    (["roberta_ex4","roberta_ex5","deberta_ex5","deberta_ex4"], "WITH roberta_ex4"),
    (["roberta_ex5","deberta_ex5","deberta_ex4"], "WITHOUT roberta_ex4")
]:
    preds = agreement_weighted_preds_logits(all_logits, keep_models)
    print(f"\n=== Agreement Weighted Voting (logits) ({tag}) ===")
    print(classification_report(labels, preds, target_names=ORDER, digits=4, zero_division=0))



=== Agreement Weighted Voting (logits) (WITH roberta_ex4) ===
                    precision    recall  f1-score   support

extremely negative     0.8857    0.8902    0.8880       592
          negative     0.8379    0.8838    0.8602      1041
           neutral     0.9306    0.8449    0.8857       619
          positive     0.8528    0.8627    0.8577       947
extremely positive     0.9009    0.8798    0.8902       599

          accuracy                         0.8726      3798
         macro avg     0.8816    0.8723    0.8764      3798
      weighted avg     0.8741    0.8726    0.8728      3798


=== Agreement Weighted Voting (logits) (WITHOUT roberta_ex4) ===
                    precision    recall  f1-score   support

extremely negative     0.8821    0.8970    0.8894       592
          negative     0.8484    0.8818    0.8648      1041
           neutral     0.9143    0.8449    0.8783       619
          positive     0.8614    0.8532    0.8573       947
extremely positive     0.88

## ⚖️ Method 4: Class-Wise Disagreement (Hard Predictions)

In this method, we refine the ensemble by considering **class-dependent reliability** of each model.  
Instead of assigning a single global weight to a model, we compute weights **per class** based on how often each model disagrees with others on that class.

---

### ⚙️ How It Works
1. **Per-Class Disagreement**  
   - For each class, we check how often a model’s predictions **disagree** with the other models.  
   - Models that disagree less on a specific class are considered more reliable for that class.

2. **Class-Specific Weights**  
   - Each model receives a different weight for each class.  
   - For example, a model that is very good at recognizing *neutral* tweets but weak at *extremely positive* will get higher weight only for *neutral*.

3. **Weighted Voting**  
   - During prediction, the ensemble uses these **class-specific weights** to aggregate votes.  
   - This creates a **dynamic voting system**, where the influence of a model depends on the true class.

---

### 📊 Results
- **With RoBERTa ex4**: Accuracy = **0.8712**  
- **Without RoBERTa ex4**: Accuracy = **0.8699**

---

### 🔎 Insights
- The class-wise weighting achieves accuracy levels similar to the logit-weighted ensembles.  
- Removing the weaker **RoBERTa ex4** slightly reduces performance in this setup, suggesting that even weaker models may still add value in specific classes.  
- Overall, class-wise disagreement is a **nuanced approach** that allows the ensemble to exploit the **specialization strengths** of different models.


In [28]:
# =============================
# Method 4: Class-wise disagreement (hard predictions)
# =============================
from collections import defaultdict

def classwise_disagreement_weights(preds_dict, labels):
    models = list(preds_dict.keys())
    C = len(np.unique(labels))
    weights = {m: np.zeros(C) for m in models}

    for c in range(C):
        idx = np.where(labels == c)[0]
        if len(idx) == 0:
            continue

        # compute disagreements per class
        for i, m1 in enumerate(models):
            total_disagree = 0
            for j, m2 in enumerate(models):
                if i == j:
                    continue
                d = np.mean(preds_dict[m1][idx] != preds_dict[m2][idx])
                total_disagree += d
            weights[m1][c] = 1.0 / (total_disagree + 1e-8)

        # normalize per class
        total = sum(weights[m][c] for m in models)
        for m in models:
            weights[m][c] /= total
    return weights

def ensemble_classwise_preds(preds_dict, labels, weights):
    models = list(preds_dict.keys())
    C = len(np.unique(labels))
    N = len(labels)
    final_preds = np.zeros(N, dtype=int)

    for k in range(N):
        class_weights = {m: weights[m][labels[k]] for m in models}
        votes = np.zeros(C)
        for m in models:
            pred = preds_dict[m][k]
            votes[pred] += class_weights[m]
        final_preds[k] = np.argmax(votes)
    return final_preds

# Run with and without RoBERTa ex4
for subset_name, subset_models in [
    ("with_roberta4", ["roberta_ex4","roberta_ex5","deberta_ex4","deberta_ex5"]),
    ("without_roberta4", ["roberta_ex5","deberta_ex4","deberta_ex5"])
]:
    subset_preds = {m: all_preds[m] for m in subset_models}
    weights = classwise_disagreement_weights(subset_preds, labels)
    final_preds = ensemble_classwise_preds(subset_preds, labels, weights)
    acc = accuracy_score(labels, final_preds)
    print(f"🔹 Class-wise disagreement (hard) {subset_name}: acc={acc:.4f}")


🔹 Class-wise disagreement (hard) with_roberta4: acc=0.8712
🔹 Class-wise disagreement (hard) without_roberta4: acc=0.8699


## ⚖️ Method 5: Class-Wise Disagreement (Logits)

This method extends the **class-wise disagreement idea** but applies it at the **logit level** instead of hard predictions.  
The intuition is that logits carry richer information (confidence and distribution across classes), which may allow for more precise weighting.

---

### ⚙️ How It Works
1. **Logit Distance Per Class**  
   - For each class, we compute how *far apart* the logits of each model are compared to the others.  
   - Models that produce logits more consistent with the group (smaller distances) are weighted higher.

2. **Class-Specific Weights**  
   - Each model gets a separate weight for each class, but this time based on **logit similarity**, not just prediction agreement.

3. **Weighted Logit Aggregation**  
   - At inference, logits from all models are combined using these class-specific weights.  
   - The final prediction is taken as the class with the highest weighted sum of logits.

---

### 📊 Results
- **With RoBERTa ex4**: Accuracy = **0.8723**  
- **Without RoBERTa ex4**: Accuracy = **0.8734**

---

### 🔎 Insights
- Using **logit-level disagreement** yields results very close to (and slightly better than) class-wise hard voting.  
- Interestingly, excluding the weaker **RoBERTa ex4** improves accuracy here, unlike in Method 4.  
- This suggests that at the **logit level**, noisy models may introduce more harm than benefit, while in **hard-voting schemes** they may still add diversity value.


In [29]:
# =============================
# Method 5: Class-wise disagreement (logits)
# =============================
def classwise_logit_weights(logits_dict, labels):
    models = list(logits_dict.keys())
    C = len(np.unique(labels))
    weights = {m: np.zeros(C) for m in models}

    for c in range(C):
        idx = np.where(labels == c)[0]
        if len(idx) == 0:
            continue

        for i, m1 in enumerate(models):
            total_dist = 0
            for j, m2 in enumerate(models):
                if i == j:
                    continue
                d = np.mean(np.sum((logits_dict[m1][idx] - logits_dict[m2][idx])**2, axis=1))
                total_dist += d
            weights[m1][c] = 1.0 / (total_dist + 1e-8)

        # normalize per class
        total = sum(weights[m][c] for m in models)
        for m in models:
            weights[m][c] /= total
    return weights

def ensemble_classwise_logits(logits_dict, labels, weights):
    models = list(logits_dict.keys())
    C = logits_dict[models[0]].shape[1]
    N = len(labels)
    final_preds = np.zeros(N, dtype=int)

    for k in range(N):
        class_weights = {m: weights[m][labels[k]] for m in models}
        weighted_sum = np.zeros(C)
        for m in models:
            weighted_sum += class_weights[m] * logits_dict[m][k]
        final_preds[k] = np.argmax(weighted_sum)
    return final_preds

# Run with and without RoBERTa ex4
for subset_name, subset_models in [
    ("with_roberta4", ["roberta_ex4","roberta_ex5","deberta_ex4","deberta_ex5"]),
    ("without_roberta4", ["roberta_ex5","deberta_ex4","deberta_ex5"])
]:
    subset_logits = {m: all_logits[m] for m in subset_models}
    weights = classwise_logit_weights(subset_logits, labels)
    final_preds = ensemble_classwise_logits(subset_logits, labels, weights)
    acc = accuracy_score(labels, final_preds)
    print(f"🔹 Class-wise disagreement (logits) {subset_name}: acc={acc:.4f}")


🔹 Class-wise disagreement (logits) with_roberta4: acc=0.8723
🔹 Class-wise disagreement (logits) without_roberta4: acc=0.8734


## 🚀 Method 6: Class-Wise Disagreement + Per-Sample Confidence

This method combines **class-wise disagreement weighting** (Method 5) with an additional factor:  
the **confidence of each model on each sample**.  
The idea is that models should have more influence when they are **both consistent with other models** *and* **confident in their prediction**.

---

### ⚙️ How It Works
1. **Class-Wise Disagreement (logits)**  
   - As in Method 5, each model gets per-class weights based on how close its logits are to the others.

2. **Per-Sample Confidence**  
   - For each prediction, we compute the softmax probability of the predicted class.  
   - This confidence score is multiplied with the class weight, giving higher influence to confident models.

3. **Final Prediction**  
   - The weighted logits across models are summed for each sample, and the final class is chosen as the argmax.

---

### 📊 Results
- **With RoBERTa ex4**: Accuracy = **0.8723**  
- **Without RoBERTa ex4**: Accuracy = **0.8749**

---

### 🔎 Insights
- Confidence weighting slightly improves results compared to Method 5.  
- Again, **excluding the weaker RoBERTa ex4** gives better performance, suggesting that even with weighting, a weak model can reduce ensemble quality.  
- The best accuracy so far (**0.8749**) comes from **Method 6 without RoBERTa ex4**, showing the value of combining **disagreement-based weighting** with **sample-level confidence adjustment**.


In [30]:
# =============================
# Method 6: Class-wise disagreement + per-sample confidence weighting
# =============================
import torch.nn.functional as F

def ensemble_classwise_logits_confidence(logits_dict, labels, weights):
    models = list(logits_dict.keys())
    C = logits_dict[models[0]].shape[1]
    N = len(labels)
    final_preds = np.zeros(N, dtype=int)

    for k in range(N):
        class_weights = {m: weights[m][labels[k]] for m in models}
        weighted_sum = np.zeros(C)

        for m in models:
            # base weight from class disagreement
            w = class_weights[m]
            # per-sample confidence (softmax peak)
            probs = F.softmax(torch.tensor(logits_dict[m][k]), dim=-1).numpy()
            conf = probs.max()
            # combine
            w_final = w * conf
            weighted_sum += w_final * logits_dict[m][k]

        final_preds[k] = np.argmax(weighted_sum)
    return final_preds

# Run with and without RoBERTa ex4
for subset_name, subset_models in [
    ("with_roberta4", ["roberta_ex4","roberta_ex5","deberta_ex4","deberta_ex5"]),
    ("without_roberta4", ["roberta_ex5","deberta_ex4","deberta_ex5"])
]:
    subset_logits = {m: all_logits[m] for m in subset_models}
    weights = classwise_logit_weights(subset_logits, labels)   # from method 5
    final_preds = ensemble_classwise_logits_confidence(subset_logits, labels, weights)
    acc = accuracy_score(labels, final_preds)
    print(f"🔹 Class-wise disagreement + confidence {subset_name}: acc={acc:.4f}")


🔹 Class-wise disagreement + confidence with_roberta4: acc=0.8723
🔹 Class-wise disagreement + confidence without_roberta4: acc=0.8749


# 📊 Ensemble Methods Performance Summary

We evaluated multiple ensemble strategies across **RoBERTa** and **DeBERTa** models.  
Below is a comparison of their test accuracies, with and without including the weaker **RoBERTa ex4** model.

---

## ✅ Accuracy Comparison

| Method | Description | With RoBERTa ex4 | Without RoBERTa ex4 |
|--------|-------------|------------------|----------------------|
| **Best Single Model** | DeBERTa ex4 alone | **0.8662** | – |
| **1** | Simple average of logits | 0.8678 | **0.8734** |
| **2** | Strict majority voting (hard) | 0.8633 | 0.8712 |
| **3** | Agreement-weighted voting (hard) | 0.8715 | 0.8718 |
| **4** | Agreement-weighted voting (logits) | 0.8726 | 0.8731 |
| **5** | Class-wise disagreement (logits) | 0.8723 | 0.8734 |
| **6** | Class-wise disagreement + per-sample confidence | 0.8723 | **0.8749** |

---

## 🔎 Insights
- **Best single model** (DeBERTa ex4) reached **0.8662 accuracy**.  
- All ensemble methods **outperform the best standalone model**.  
- **Excluding RoBERTa ex4 consistently improves performance** across methods.  
- **Simple averaging (Method 1)** already boosts performance to **0.8734**, showing strong complementarity between models.  
- **Confidence-augmented class-wise weighting (Method 6)** achieves the **best accuracy: 0.8749** without RoBERTa ex4.  
- Overall, ensembles deliver **+0.8% improvement** over the strongest individual model.

---

📌 **Takeaway:** The best-performing setup is  
**Method 6 (class-wise logits + confidence), without RoBERTa ex4**,  
reaching **0.8749 accuracy**, clearly beating any single model.
