<a href="https://colab.research.google.com/github/Ha1ion/2025_NLP_HW3/blob/main/nlp_hw3_RoBERTa.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

此作業有使用Gemini幫忙下註解幫助批改

In [1]:
!pip install "datasets==2.18.0"

Collecting datasets==2.18.0
  Downloading datasets-2.18.0-py3-none-any.whl.metadata (20 kB)
Collecting pyarrow-hotfix (from datasets==2.18.0)
  Downloading pyarrow_hotfix-0.7-py3-none-any.whl.metadata (3.6 kB)
Collecting fsspec<=2024.2.0,>=2023.1.0 (from fsspec[http]<=2024.2.0,>=2023.1.0->datasets==2.18.0)
  Downloading fsspec-2024.2.0-py3-none-any.whl.metadata (6.8 kB)
Downloading datasets-2.18.0-py3-none-any.whl (510 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m11.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.2.0-py3-none-any.whl (170 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m170.9/170.9 kB[0m [31m18.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyarrow_hotfix-0.7-py3-none-any.whl (7.9 kB)
Installing collected packages: pyarrow-hotfix, fsspec, datasets
  Attempting uninstall: fsspec
    Found existing installation: fsspec 2025.3.0
    Uninstalling fsspec-2025.3.0:
      Successfully uninstalled

In [2]:
!pip install transformers datasets evaluate
from transformers import BertTokenizer, BertModel, RobertaTokenizer, RobertaModel
from datasets import load_dataset
from evaluate import load
import torch
from torch.utils.data import Dataset, DataLoader
from torch.optim import AdamW
from tqdm import tqdm
device = "cuda" if torch.cuda.is_available() else "cpu"
#  You can install and import any other libraries if needed

Collecting evaluate
  Downloading evaluate-0.4.6-py3-none-any.whl.metadata (9.5 kB)
Downloading evaluate-0.4.6-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.6


In [3]:
# Some Chinese punctuations will be tokenized as [UNK], so we replace them with English ones
token_replacement = [
    ["：" , ":"],
    ["，" , ","],
    ["“" , "\""],
    ["”" , "\""],
    ["？" , "?"],
    ["……" , "..."],
    ["！" , "!"]
]

In [None]:
tokenizer = RobertaTokenizer.from_pretrained("roberta-base", cache_dir="./cache/")

In [None]:
class SemevalDataset(Dataset):
    def __init__(self, split="train") -> None:
        super().__init__()
        assert split in ["train", "validation", "test"]
        self.data = load_dataset(
            "sem_eval_2014_task_1", split=split, trust_remote_code=True, cache_dir="./cache/"
        ).to_list()

    def __getitem__(self, index):
        d = self.data[index]
        # Replace Chinese punctuations with English ones
        for k in ["premise", "hypothesis"]:
            for tok in token_replacement:
                d[k] = d[k].replace(tok[0], tok[1])
        return d

    def __len__(self):
        return len(self.data)

data_sample = SemevalDataset(split="train").data[:3]
print(f"Dataset example: \n{data_sample[0]} \n{data_sample[1]} \n{data_sample[2]}")

In [6]:
# Define the hyperparameters
# You can modify these values if needed
lr = 3e-5
epochs = 3
train_batch_size = 8
validation_batch_size = 8

In [7]:
# TODO1: Create batched data for DataLoader
# `collate_fn` is a function that defines how the data batch should be packed.
# This function will be called in the DataLoader to pack the data batch.

def collate_fn(batch):
    # TODO1-1: Implement the collate_fn function

    # 1. 從 batch 中分別取出所有 premise 和 hypothesis
    premises = [d['premise'] for d in batch]
    hypotheses = [d['hypothesis'] for d in batch]

    # 2. 使用 tokenizer 處理句子對
    inputs = tokenizer(
        premises,
        hypotheses,
        padding=True,
        truncation=True,
        return_tensors="pt"
    )

    # 3. 處理標籤
    # [FIX]: 根據 print 輸出的真實 key

    # Sub-task 1: 使用 'relatedness_score'
    inputs['labels_sim'] = torch.tensor(
        [d['relatedness_score'] for d in batch],
        dtype=torch.float
    )

    # Sub-task 2: 使用 'entailment_judgment' (注意拼寫!)
    inputs['labels_ent'] = torch.tensor(
        [d['entailment_judgment'] for d in batch],
        dtype=torch.long
    )

    return inputs

# TODO1-2: Define your DataLoader
# (這部分的程式碼 dl_train, dl_validation, dl_test 保持不變)

# 1. 建立 Dataset 實例
train_dataset = SemevalDataset(split="train")
validation_dataset = SemevalDataset(split="validation")
test_dataset = SemevalDataset(split="test")

# 2. 建立 DataLoader
dl_train = DataLoader(
    train_dataset,
    batch_size=train_batch_size,
    shuffle=True,
    collate_fn=collate_fn
)

dl_validation = DataLoader(
    validation_dataset,
    batch_size=validation_batch_size,
    shuffle=False,
    collate_fn=collate_fn
)

dl_test = DataLoader(
    test_dataset,
    batch_size=validation_batch_size,
    shuffle=False,
    collate_fn=collate_fn
)

In [8]:
# TODO2: Construct your model
class MultiLabelModel(torch.nn.Module):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        # Write your code here
        # Define what modules you will use in the model
        # Please use "google-bert/bert-base-uncased" model (https://huggingface.co/google-bert/bert-base-uncased)
        # Besides the base model, you may design additional architectures by incorporating linear layers, activation functions, or other neural components.
        # Remark: The use of any additional pretrained language models is not permitted.

        # 1. 載入模型
        self.bert = RobertaModel.from_pretrained(
            "roberta-base",
            cache_dir="./cache/"
        )

        hidden_size = self.bert.config.hidden_size

        # 2. 定義 Sub-task 1 (迴歸) 的輸出頭
        # 輸出維度為 1 (預測 relatedness_score)
        self.regression_head = torch.nn.Linear(hidden_size, 1)

        # 3. 定義 Sub-task 2 (分類) 的輸出頭
        # 輸出維度為 3 (3 個類別: 0, 1, 2)
        self.classification_head = torch.nn.Linear(hidden_size, 3)

    def forward(self, **kwargs):
        # Write your code here
        # Forward pass

        # 1. 將 collate_fn 傳來的 input_ids 和 attention_mask 傳入 BERT
        # 我們從 **kwargs 中取出 'labels_sim' 和 'labels_ent'，這樣它們就不會被傳入 BERT
        labels_sim = kwargs.pop("labels_sim", None)
        labels_ent = kwargs.pop("labels_ent", None)

        # **kwargs 現在只包含 BERT 接受的參數 (input_ids, attention_mask, token_type_ids)
        bert_output = self.bert(**kwargs)

        # 2. 取得 [CLS] token 的輸出 (pooler_output)
        # 這是整個輸入序列 (premise + hypothesis) 的語意表示
        pooled_output = bert_output.pooler_output

        # 3. 將 pooled_output 分別傳入兩個 head
        logits_sim = self.regression_head(pooled_output)
        logits_ent = self.classification_head(pooled_output)

        # 4. 回傳兩個 head 的輸出
        return logits_sim, logits_ent

In [None]:
# TODO3: Define your optimizer and loss function

model = MultiLabelModel().to(device)
# TODO3-1: Define your Optimizer
# We use AdamW as recommended by the PDF  and it's standard for Transformers.
optimizer = AdamW(model.parameters(), lr=lr)

# TODO3-2: Define your loss functions (you should have two) [cite: 171]
# Use different loss functions for different types of tasks.

# Sub-task 1 (relatedness_score) is regression, so we use MSELoss.
loss_sim_fn = torch.nn.MSELoss()

# Sub-task 2 (entailment_judgement) is 3-class classification, so we use CrossEntropyLoss.
loss_ent_fn = torch.nn.CrossEntropyLoss()


# scoring functions
psr = load("pearsonr")
acc = load("accuracy")

In [10]:
best_score = 0.0
for ep in range(epochs):
    pbar = tqdm(dl_train)
    pbar.set_description(f"Training epoch [{ep+1}/{epochs}]")
    model.train()
    # TODO4: Write the training loop
    # Write your code here
    # train your model
    # clear gradient
    # forward pass
    # compute loss
    # back-propagation
    # model optimization

    # 初始化 total loss 來追蹤這個 epoch 的平均 loss
    total_train_loss = 0.0

    for batch in pbar:
        # 1. 將資料移動到 device
        batch = {k: v.to(device) for k, v in batch.items()}

        # 2. 取得標籤
        labels_sim = batch['labels_sim']
        labels_ent = batch['labels_ent']

        # 3. clear gradient
        optimizer.zero_grad()

        # 4. forward pass
        # 我們在模型 forward 中已經處理了 **kwargs，所以可以直接傳入 batch
        logits_sim, logits_ent = model(**batch)

        # 5. compute loss
        # 迴歸 loss (記得 squeeze logits_sim 才能匹配 (batch_size,) 的 shape)
        loss_sim = loss_sim_fn(logits_sim.squeeze(), labels_sim)
        # 分類 loss
        loss_ent = loss_ent_fn(logits_ent, labels_ent)

        # 合併兩個 loss
        total_loss = loss_sim + loss_ent

        # 6. back-propagation
        total_loss.backward()

        # 7. model optimization
        optimizer.step()

        total_train_loss += total_loss.item()
        pbar.set_postfix({"loss": total_loss.item()})

    print(f"Epoch {ep+1} Average Train Loss: {total_train_loss / len(dl_train)}")

    pbar = tqdm(dl_validation)
    pbar.set_description(f"Validation epoch [{ep+1}/{epochs}]")
    model.eval()

    # TODO5: Write the evaluation loop
    # Write your code here
    # Evaluate your model
    # Output all the evaluation scores (PearsonCorr, Accuracy)

    # 建立 list 來儲存所有預測和標籤
    all_preds_sim = []
    all_labels_sim = []
    all_preds_ent = []
    all_labels_ent = []

    with torch.no_grad(): # 驗證時不需要計算梯度
        for batch in pbar:
            # 1. 將資料移動到 device
            batch = {k: v.to(device) for k, v in batch.items()}

            # 2. 取得標籤
            labels_sim = batch['labels_sim']
            labels_ent = batch['labels_ent']

            # 3. forward pass
            logits_sim, logits_ent = model(**batch)

            # 4. 處理預測結果
            # 迴歸預測 (squeeze)
            preds_sim = logits_sim.squeeze()
            # 分類預測 (argmax)
            preds_ent = torch.argmax(logits_ent, dim=1)

            # 5. 收集結果 (移回 CPU)
            all_preds_sim.extend(preds_sim.cpu().tolist())
            all_labels_sim.extend(labels_sim.cpu().tolist())
            all_preds_ent.extend(preds_ent.cpu().tolist())
            all_labels_ent.extend(labels_ent.cpu().tolist())

    # 在迴圈結束後，計算整體分數

    # PearsonCorr [cite: 252]
    pearson_corr = psr.compute(
        predictions=all_preds_sim,
        references=all_labels_sim
    )['pearsonr']

    # Accuracy [cite: 253]
    accuracy = acc.compute(
        predictions=all_preds_ent,
        references=all_labels_ent
    )['accuracy']

    print(f"Epoch {ep+1} Validation:")
    print(f"Pearson Correlation: {pearson_corr}")
    print(f"Accuracy: {accuracy}")

    # 儲存最佳模型
    # (修正：範本中的 'best' 變數應為 'best_score')
    current_score = pearson_corr + accuracy
    if current_score > best_score:
        best_score = current_score
        print(f"New best score: {best_score}. Saving model...")
        # 確保 saved_models 資料夾存在
        import os
        os.makedirs("./saved_models", exist_ok=True)
        torch.save(model.state_dict(), f'./saved_models/best_model.ckpt')

Training epoch [1/3]: 100%|██████████| 563/563 [00:55<00:00, 10.06it/s, loss=1.05]


Epoch 1 Average Train Loss: 1.2791515068940118


Validation epoch [1/3]: 100%|██████████| 63/63 [00:01<00:00, 55.61it/s]


Epoch 1 Validation:
Pearson Correlation: 0.8663668483333626
Accuracy: 0.864
New best score: 1.7303668483333627. Saving model...


Training epoch [2/3]: 100%|██████████| 563/563 [00:56<00:00,  9.94it/s, loss=0.467]


Epoch 2 Average Train Loss: 0.6166293499974755


Validation epoch [2/3]: 100%|██████████| 63/63 [00:01<00:00, 55.21it/s]


Epoch 2 Validation:
Pearson Correlation: 0.8694778143956021
Accuracy: 0.85


Training epoch [3/3]: 100%|██████████| 563/563 [00:52<00:00, 10.76it/s, loss=0.303]


Epoch 3 Average Train Loss: 0.4929061570001326


Validation epoch [3/3]: 100%|██████████| 63/63 [00:01<00:00, 53.54it/s]


Epoch 3 Validation:
Pearson Correlation: 0.8785317529848184
Accuracy: 0.89
New best score: 1.7685317529848184. Saving model...


In [11]:
# Load the model
model = MultiLabelModel().to(device)
# 載入我們儲存的最佳模型權重
model.load_state_dict(torch.load(f"./saved_models/best_model.ckpt", weights_only=True))

# Test Loop
pbar = tqdm(dl_test, desc="Test")
model.eval()

# TODO6: Write the test loop
# Write your code here
# We have loaded the best model with the highest evaluation score for you
# Please implement the test loop to evaluate the model on the test dataset
# We will have 10% of the total score for the test accuracy and pearson correlation

# 建立 list 來儲存所有預測和標籤
all_preds_sim = []
all_labels_sim = []
all_preds_ent = []
all_labels_ent = []

with torch.no_grad(): # 測試時不需要計算梯度
    for batch in pbar:
        # 1. 將資料移動到 device
        batch = {k: v.to(device) for k, v in batch.items()}

        # 2. 取得標籤
        labels_sim = batch['labels_sim']
        labels_ent = batch['labels_ent']

        # 3. forward pass
        logits_sim, logits_ent = model(**batch)

        # 4. 處理預測結果
        preds_sim = logits_sim.squeeze()
        preds_ent = torch.argmax(logits_ent, dim=1)

        # 5. 收集結果 (移回 CPU)
        all_preds_sim.extend(preds_sim.cpu().tolist())
        all_labels_sim.extend(labels_sim.cpu().tolist())
        all_preds_ent.extend(preds_ent.cpu().tolist())
        all_labels_ent.extend(labels_ent.cpu().tolist())

# 在迴圈結束後，計算並印出最終的測試分數

# PearsonCorr
test_pearson_corr = psr.compute(
    predictions=all_preds_sim,
    references=all_labels_sim
)['pearsonr']

# Accuracy
test_accuracy = acc.compute(
    predictions=all_preds_ent,
    references=all_labels_ent
)['accuracy']

print("\n--- Test Set Results ---")
print(f"Final Pearson Correlation: {test_pearson_corr}")
print(f"Final Accuracy: {test_accuracy}")

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Test: 100%|██████████| 616/616 [00:10<00:00, 56.50it/s]


--- Test Set Results ---
Final Pearson Correlation: 0.8721033499048236
Final Accuracy: 0.8891820580474934





In [12]:
# --- Error Analysis Cell ---
#
# 這個儲存格的目的是分析你表現最好的模型 (RoBERTa)
# 它會找出模型在驗證集上預測錯誤的範例

from transformers import RobertaTokenizer, RobertaModel
import torch
from tqdm import tqdm

# (如果 Colab 斷線了，可能需要重新 import)
# from datasets import load_dataset
# from torch.utils.data import Dataset, DataLoader
# device = "cuda" if torch.cuda.is_available() else "cpu"

print("--- Starting Error Analysis for RoBERTa-base ---")

# --- 1. 重新載入 RoBERTa-base Tokenizer ---
# (確保我們使用的是 RoBERTa 的 tokenizer)
print("Loading RoBERTa tokenizer...")
tokenizer = RobertaTokenizer.from_pretrained("roberta-base", cache_dir="./cache/")

# --- 2. 重新定義 MultiLabelModel (RoBERTa 版本) ---
# (我們必須確保 `MultiLabelModel` class 是 RoBERTa 的架構，而不是 GPT-2)
print("Defining RoBERTa model architecture...")
class MultiLabelModel(torch.nn.Module):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        # 載入 RoBERTa-base
        self.bert = RobertaModel.from_pretrained(
            "roberta-base",
            cache_dir="./cache/"
        )
        hidden_size = self.bert.config.hidden_size
        self.regression_head = torch.nn.Linear(hidden_size, 1)
        self.classification_head = torch.nn.Linear(hidden_size, 3)

    def forward(self, **kwargs):
        # RoBERTa 和 BERT 一樣使用 pooler_output
        labels_sim = kwargs.pop("labels_sim", None)
        labels_ent = kwargs.pop("labels_ent", None)
        bert_output = self.bert(**kwargs)
        pooled_output = bert_output.pooler_output # 使用 pooler_output
        logits_sim = self.regression_head(pooled_output)
        logits_ent = self.classification_head(pooled_output)
        return logits_sim, logits_ent

# --- 3. 載入模型和最佳權重 ---
# (假設你最好的模型儲存在 'best_model.ckpt')
print("Loading best RoBERTa model weights...")
analysis_model = MultiLabelModel().to(device)
analysis_model.load_state_dict(torch.load(f"./saved_models/best_model.ckpt", weights_only=True))
analysis_model.eval() # 設為評估模式

# --- 4. 載入驗證資料集 (SemevalDataset 應該已經定義過了) ---
print("Loading validation dataset...")
validation_dataset = SemevalDataset(split="validation")

# --- 5. 遍歷驗證集，找出錯誤 ---
print("\n--- Finding Errors in Validation Set ---")

# 建立一個 (value -> name) 的對應表，方便閱讀
# 0: NEUTRAL, 1: ENTAILMENT, 2: CONTRADICTION
entailment_map = {0: "NEUTRAL", 1: "ENTAILMENT", 2: "CONTRADICTION"}

# 設定要印出多少錯誤範例
classification_errors_found = 0
regression_errors_found = 0
max_errors_to_show = 15

with torch.no_grad():
    # 我們將會逐筆分析 (batch size 1)，這樣最清楚
    for data_point in tqdm(validation_dataset, desc="Analyzing..."):

        # 1. 取得資料並手動打包 (batch size 1)
        premise = data_point['premise']
        hypothesis = data_point['hypothesis']
        label_sim = data_point['relatedness_score']
        label_ent = data_point['entailment_judgment']

        # 2. Tokenize
        inputs = tokenizer(
            premise,
            hypothesis,
            padding=True,
            truncation=True,
            return_tensors="pt"
        ).to(device)

        # 3. 預測
        logits_sim, logits_ent = analysis_model(**inputs)

        # 4. 取得預測結果
        pred_sim = logits_sim.squeeze().item()
        pred_ent = torch.argmax(logits_ent, dim=1).item()

        # --- 5. 檢查並印出錯誤 ---

        # 檢查分類錯誤
        if pred_ent != label_ent and classification_errors_found < max_errors_to_show:
            print(f"\n[CLASSIFICATION ERROR #{classification_errors_found + 1}]")
            print(f"  PREMISE:    {premise}")
            print(f"  HYPOTHESIS: {hypothesis}")
            print(f"  MODEL PREDICTED: {entailment_map[pred_ent]}")
            print(f"  TRUE LABEL:      {entailment_map[label_ent]}")
            classification_errors_found += 1

        # 檢查迴歸錯誤 (例如：誤差大於 1.0)
        sim_error = abs(pred_sim - label_sim)
        if sim_error > 1.0 and regression_errors_found < max_errors_to_show:
            print(f"\n[REGRESSION ERROR #{regression_errors_found + 1}]")
            print(f"  PREMISE:    {premise}")
            print(f"  HYPOTHESIS: {hypothesis}")
            print(f"  MODEL PREDICTED: {pred_sim:.2f}")
            print(f"  TRUE LABEL:      {label_sim:.2f} (Error: {sim_error:.2f})")
            regression_errors_found += 1

        if classification_errors_found >= max_errors_to_show and regression_errors_found >= max_errors_to_show:
            print("\n--- Analysis complete: Reached max errors to show. ---")
            break

if not (classification_errors_found >= max_errors_to_show and regression_errors_found >= max_errors_to_show):
    print("\n--- Analysis complete: All validation data checked. ---")

--- Starting Error Analysis for RoBERTa-base ---
Loading RoBERTa tokenizer...
Defining RoBERTa model architecture...
Loading best RoBERTa model weights...


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Loading validation dataset...

--- Finding Errors in Validation Set ---


Analyzing...:   2%|▏         | 10/500 [00:00<00:05, 95.21it/s]


[CLASSIFICATION ERROR #1]
  PREMISE:    Two dogs are playing by a tree
  HYPOTHESIS: Two dogs are playing by a plant
  MODEL PREDICTED: NEUTRAL
  TRUE LABEL:      ENTAILMENT


Analyzing...:   4%|▍         | 22/500 [00:00<00:04, 105.27it/s]


[CLASSIFICATION ERROR #2]
  PREMISE:    A group of scouts are hiking through the grass
  HYPOTHESIS: A group of explorers are walking through the grass
  MODEL PREDICTED: ENTAILMENT
  TRUE LABEL:      NEUTRAL

[REGRESSION ERROR #1]
  PREMISE:    A lady is surfing and riding a wave
  HYPOTHESIS: A blond girl is looking at the waves
  MODEL PREDICTED: 1.78
  TRUE LABEL:      3.30 (Error: 1.52)

[CLASSIFICATION ERROR #3]
  PREMISE:    The woman wearing silver pants, pink bellbottoms and a pink scarf is riding a bike
  HYPOTHESIS: Pink bellbottoms and a pink scarf aren't to be worn by women with silver pants or bike riding people
  MODEL PREDICTED: CONTRADICTION
  TRUE LABEL:      NEUTRAL

[CLASSIFICATION ERROR #4]
  PREMISE:    A woman is taking off a cloak, which is very large, and revealing an extravagant dress
  HYPOTHESIS: A woman is putting on a cloak, which is very large, and concealing an extravagant dress
  MODEL PREDICTED: NEUTRAL
  TRUE LABEL:      CONTRADICTION


Analyzing...:   7%|▋         | 33/500 [00:00<00:04, 104.29it/s]


[CLASSIFICATION ERROR #5]
  PREMISE:    A person is climbing a rock with a rope, which is pink
  HYPOTHESIS: One man is climbing a cliff with a rope
  MODEL PREDICTED: ENTAILMENT
  TRUE LABEL:      NEUTRAL

[CLASSIFICATION ERROR #6]
  PREMISE:    A few men in a competition are running outside
  HYPOTHESIS: A few men in a competition are running indoors
  MODEL PREDICTED: NEUTRAL
  TRUE LABEL:      CONTRADICTION

[CLASSIFICATION ERROR #7]
  PREMISE:    A few men in a competition are running outside
  HYPOTHESIS: A few men are running competitions outside
  MODEL PREDICTED: ENTAILMENT
  TRUE LABEL:      NEUTRAL


Analyzing...:  16%|█▋        | 82/500 [00:00<00:03, 115.51it/s]


[CLASSIFICATION ERROR #8]
  PREMISE:    A woman is dancing and singing alone
  HYPOTHESIS: A woman is dancing and singing with other women
  MODEL PREDICTED: NEUTRAL
  TRUE LABEL:      CONTRADICTION

[REGRESSION ERROR #2]
  PREMISE:    A man is playing flute
  HYPOTHESIS: A man is playing a game with a ball
  MODEL PREDICTED: 3.42
  TRUE LABEL:      1.50 (Error: 1.92)

[CLASSIFICATION ERROR #9]
  PREMISE:    A man is playing a guitar
  HYPOTHESIS: A man is strumming a guitar
  MODEL PREDICTED: NEUTRAL
  TRUE LABEL:      ENTAILMENT

[REGRESSION ERROR #3]
  PREMISE:    Three men are practicing karate outdoors
  HYPOTHESIS: Three boys in karate costumes are fighting
  MODEL PREDICTED: 2.35
  TRUE LABEL:      3.80 (Error: 1.45)

[CLASSIFICATION ERROR #10]
  PREMISE:    A hole is being burrowed by the badger
  HYPOTHESIS: A badger is shrewdly digging the earth
  MODEL PREDICTED: NEUTRAL
  TRUE LABEL:      ENTAILMENT

[CLASSIFICATION ERROR #11]
  PREMISE:    Some ingredients are being mixed

Analyzing...:  21%|██▏       | 107/500 [00:00<00:03, 117.50it/s]


[CLASSIFICATION ERROR #12]
  PREMISE:    The kittens on the trays are being eaten as food for an advertisement
  HYPOTHESIS: The food on the trays is being eaten by the kittens
  MODEL PREDICTED: ENTAILMENT
  TRUE LABEL:      NEUTRAL

[REGRESSION ERROR #4]
  PREMISE:    The kittens on the trays are being eaten as food for an advertisement
  HYPOTHESIS: The food on the trays is being eaten by the kittens
  MODEL PREDICTED: 4.60
  TRUE LABEL:      3.50 (Error: 1.10)

[CLASSIFICATION ERROR #13]
  PREMISE:    The food on the trays is being eaten by the kittens
  HYPOTHESIS: A few kittens are eating
  MODEL PREDICTED: NEUTRAL
  TRUE LABEL:      ENTAILMENT

[CLASSIFICATION ERROR #14]
  PREMISE:    A lady is cutting up some meat precisely
  HYPOTHESIS: Some meat is being cut into pieces by a woman
  MODEL PREDICTED: NEUTRAL
  TRUE LABEL:      ENTAILMENT

[CLASSIFICATION ERROR #15]
  PREMISE:    The man is denying an interview
  HYPOTHESIS: The man is granting an interview
  MODEL PREDICTED: 

Analyzing...:  29%|██▊       | 143/500 [00:01<00:03, 115.86it/s]


[REGRESSION ERROR #6]
  PREMISE:    There is no cat eating corn on the cob
  HYPOTHESIS: A cat is eating some corn
  MODEL PREDICTED: 4.04
  TRUE LABEL:      2.50 (Error: 1.54)

[REGRESSION ERROR #7]
  PREMISE:    The monkey is brushing a bull dog
  HYPOTHESIS: A bull dog is brushing the monkey
  MODEL PREDICTED: 4.99
  TRUE LABEL:      3.90 (Error: 1.09)

[REGRESSION ERROR #8]
  PREMISE:    The person is slicing a clove of garlic into pieces
  HYPOTHESIS: The person is not slicing a clove of garlic into pieces
  MODEL PREDICTED: 4.30
  TRUE LABEL:      3.00 (Error: 1.30)

[REGRESSION ERROR #9]
  PREMISE:    The man is kick boxing with a trainer
  HYPOTHESIS: A karate practitioner is kicking at another man who is wearing protective boxing gloves
  MODEL PREDICTED: 2.56
  TRUE LABEL:      4.00 (Error: 1.44)

[REGRESSION ERROR #10]
  PREMISE:    A woman on a rock is lying on a blanket and reading a book
  HYPOTHESIS: A woman is rocking over a blanket lying on someone reading a book
  MO

Analyzing...:  36%|███▌      | 181/500 [00:01<00:02, 118.98it/s]


[REGRESSION ERROR #12]
  PREMISE:    A monkey is kicking at a person's glove
  HYPOTHESIS: The monkey is practicing martial arts
  MODEL PREDICTED: 2.55
  TRUE LABEL:      3.60 (Error: 1.05)

[REGRESSION ERROR #13]
  PREMISE:    The man is not dancing
  HYPOTHESIS: A woman is dancing
  MODEL PREDICTED: 1.87
  TRUE LABEL:      3.30 (Error: 1.43)


Analyzing...:  38%|███▊      | 191/500 [00:01<00:02, 114.11it/s]



[REGRESSION ERROR #14]
  PREMISE:    A woman is collecting the water from a tap in a mug
  HYPOTHESIS: A boy is filling a pitcher with water
  MODEL PREDICTED: 1.66
  TRUE LABEL:      2.70 (Error: 1.04)

[REGRESSION ERROR #15]
  PREMISE:    It is raining on a walking man
  HYPOTHESIS: A man is walking in the rain
  MODEL PREDICTED: 3.65
  TRUE LABEL:      4.90 (Error: 1.25)

--- Analysis complete: Reached max errors to show. ---
