# Results

In [1]:
import warnings
import logging
import sys

import torch
from torch import nn
from torch.nn.utils.rnn import pad_sequence

from src.data.dataset import build_vocabulary, tokenize, tokens_to_indices
from src.models.net import NLIModel
from src.models.classifiers import Classifier
from src.models.encoders import BiLSTMEncoder

logging.disable(sys.maxsize)

warnings.filterwarnings("ignore")

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

Using device: cuda


## Setup

First we load the vocabulary and the models.

In [2]:
token_to_idx, word_embeddings = build_vocabulary(split="train", glove_version="840B", word_embedding_dim=300)

In [3]:
encoder = BiLSTMEncoder(
    word_embeddings=word_embeddings,
    input_dim=300,
    output_dim=2048,
    max_pooling=True,
)
classifier = Classifier(input_dim=4096, num_classes=3)
model = NLIModel(encoder, classifier).to(device)

model.load_state_dict(torch.load("models/bilstm-max/best_model.pt", map_location=device))
model = model.to(device)

We create a prediction function that takes a sentence and returns the predicted label.

In [4]:
def predict(
    model: nn.Module,
    token_to_idx: dict[str, int],
    device: torch.device,
    premise: str,
    hypothesis: str,
) -> str:
    """Predict the entailment label of the given premise and hypothesis.

    Args:
        model (nn.Module): The model (encoder + classifier).
        token_to_idx (dict[str, int]): The token to index mapping.
        device (torch.device): The device to use.
        premise (str): The premise.
        hypothesis (str): The hypothesis.

    Returns:
        str: The predicted entailment label.
    """
    # Set the model to evaluation mode
    model.eval()
    
    id_to_label = {
            0: "entailment",
            1: "neutral",
            2: "contradiction",
        }

    # Disable gradient computation
    with torch.no_grad():
        # Tokenize the premise and hypothesis
        premise_tokens = tokenize(premise)
        hypothesis_tokens = tokenize(hypothesis)
        
        # Convert tokens to indices
        premise_indices = tokens_to_indices(premise_tokens, token_to_idx)
        hypothesis_indices = tokens_to_indices(hypothesis_tokens, token_to_idx)

        # Convert indices to tensors and wrap them in a list
        premise_indices = [torch.tensor(premise_indices, dtype=torch.long)]
        hypothesis_indices = [torch.tensor(hypothesis_indices, dtype=torch.long)]

        # Pad sequences and compute lengths
        padded_premises = pad_sequence(premise_indices, batch_first=True, padding_value=1)
        premise_lengths = torch.tensor([len(premise_indices[0])], dtype=torch.long)
        padded_hypotheses = pad_sequence(hypothesis_indices, batch_first=True, padding_value=1)
        hypothesis_lengths = torch.tensor([len(hypothesis_indices[0])], dtype=torch.long)

        # Move the batch to the device
        padded_premises = padded_premises.to(device)
        premise_lengths = premise_lengths.to(device)
        padded_hypotheses = padded_hypotheses.to(device)
        hypothesis_lengths = hypothesis_lengths.to(device)

        # Compute the logits
        logits = model(padded_premises, premise_lengths, padded_hypotheses, hypothesis_lengths)

        # Get the predictions
        predictions = torch.argmax(logits, dim=-1)
        
    return id_to_label[int(predictions.item())]


## Predictions

Predictions using the BiLSTM model with max pooling.

In [5]:
premise_1 = "Two men sitting in the sun"
hypothesis_1 = "Nobody is sitting in the shade"
label_1 = predict(model, token_to_idx, device, premise_1, hypothesis_1)
print(f"Predicted label: '{label_1}', correct label: 'Neutral'")

Predicted label: 'contradiction', correct label: 'Neutral'


In [6]:
premise_2 = "A man is walking a dog"
hypothesis_2 = "No cat is outside"
label_2 = predict(model, token_to_idx, device, premise_2, hypothesis_2)
print(f"Predicted label: '{label_2}', correct label: 'Neutral'")

Predicted label: 'contradiction', correct label: 'Neutral'


A possible reason for the failure is the presence of negations in the hypotheses, which might lead the model to focus on the opposite aspect between the premise and the hypothesis. The models may be more sensitive to negation words like "nobody" and "no" in the hypothesis, causing it to perceive a stronger contradiction than exists. Additionally, the model might struggle with understanding the relationships between different entities in the sentences, such as "men" and "nobody," or "dog" and "cat." This difficulty in capturing the semantic relationships between entities could lead the model to assess the relationship between the premise and the hypothesis incorrectly. 

## Results

The following table shows the results of the models on the SNLI dev and test sets, and the micro and macro averaged results on the SentEval tasks.

| **Model** | **SNLI Dev** | **SNLI Test** | **Micro** | **Macro** |
|---|---|---|---|---|
| Baseline | 0.671 | 0.672 | 80.611 | 79.123 |
| LSTM | 0.800 | 0.799 | 76.843 | 76.280 |
| BiLSTM | 0.795 | 0.796 | 80.516 | 80.019 |
| BiLSTM (max) | 0.836 | 0.836 | 82.556 | 81.764 |

## Analysis

### Model performance

BiLSTM (max) outperforms the other models on SNLI and SentEval tasks. This can be attributed to the architecture of the BiLSTM with max pooling. The bidirectional LSTM captures information from both forward and backward directions, which helps the model learn more contextualized sentence representations. Max pooling allows the model to focus on the most salient features of the input sequence, making it more robust to variations in sentence length and structure. However, the baseline model, which only averages word embeddings, performs worse than BiLSTM and BiLSTM (max) but better than the unidirectional LSTM. This suggests that while the baseline model is simplistic, it can still capture helpful information about the sentences. The unidirectional LSTM has a lower performance, possibly because it only captures information in one direction, limiting its ability to understand complex sentence structures.

### Model failures

All models can fail when facing complex sentence structures, negations, or dependencies that require a deeper understanding of the language. The baseline model will likely struggle more in these cases, as it relies solely on the average word embeddings and needs more information about word order and context. The unidirectional LSTM may also need help capturing dependencies that require considering the context from both directions.
The BiLSTM and BiLSTM (max) models should be more robust to such issues due to their bidirectional nature, but they are imperfect. They can fail when long-range dependencies or semantic relationships between premises and hypotheses are intricate.

### Sentence representations

The sentence embeddings represent a fixed-size vector representation that aims to capture the meaning and structure of a sentence. In the baseline model, the sentence embeddings mainly represent the average of the words' meanings, which may lose information about word order and context. The LSTM, BiLSTM, and BiLSTM (max) models can better capture the sequential nature of sentences and the context in which words appear. However, even the more complex models might lose some information. For example, they might struggle with capturing the nuances of certain syntactic or semantic relationships (such as in the example above).

## Additional experiment
### Research question: How do the models perform on sentences with different lengths?

In [7]:
from datasets import Dataset
from torch.utils.data import Subset, DataLoader
import numpy as np
from tqdm import tqdm
from functools import partial


from src.data.dataset import get_dataset, snli_collate_fn

In [8]:
def add_length_info(example: dict) -> dict:
    """Add the combined length of premise and hypothesis to the example.

    Args:
        example (dict): An example from the dataset

    Returns:
        dict: The example with an additional "length" field
    """
    example["length"] = len(example["premise"]) + len(example["hypothesis"])
    return example

def split_dataset_by_quantiles(dataset: Dataset, quantiles: list[float]) -> list[Dataset]:
    """Split the dataset into subsets based on sentence length quantiles.

    Args:
        dataset (Dataset): The dataset to split
        quantiles (List[float]): A list of quantiles to use as the split points

    Returns:
        List[Dataset]: A list of datasets containing examples grouped by sentence length
    """
    # Add length information to the dataset
    dataset = dataset.map(add_length_info)

    # Calculate the quantiles
    lengths = dataset["length"]
    quantile_values = np.quantile(lengths, quantiles)

    # Create a list of datasets by filtering based on the quantile values
    datasets = []
    for i, q in enumerate(quantile_values):
        if i == 0:
            lower_bound = 0
        else:
            lower_bound = quantile_values[i - 1]
        upper_bound = q
        subset = dataset.filter(lambda example: lower_bound <= example["length"] < upper_bound)
        datasets.append(subset)

    # Add the last subset containing examples with lengths greater than or equal to the last quantile value
    datasets.append(dataset.filter(lambda example: example["length"] >= quantile_values[-1]))

    # Remove the "length" field from the datasets
    for subset in datasets:
        subset = subset.remove_columns(["length"])

    return datasets

def evaluate(
    model: nn.Module,
    criterion: nn.Module,
    eval_data: DataLoader,
    device: torch.device,
) -> tuple[float, float]:
    """Evaluate the model on the given data.

    Args:
        model (nn.Module): The model (encoder + classifier).
        criterion (nn.Module): The loss function.
        eval_data (DataLoader): The evaluation data.
        device (torch.device): The device to use.

    Returns:
        tuple[float, float]: The average loss and accuracy on the evaluation data.
    """
    # Set the model to evaluation mode
    model.eval()

    # Keep track of the evaluation loss and correct predictions
    eval_loss = 0.0
    correct_predictions = 0

    # Disable gradient computation
    with torch.no_grad():
        for batch in tqdm(eval_data, desc="Evaluating"):

            # Unpack the batch
            (premise, hypothesis,
             premise_lengths, hypothesis_lengths,
             label) = batch

            # Move the batch to the device
            premise = premise.to(device)
            premise_lengths = premise_lengths.to(device)
            hypothesis = hypothesis.to(device)
            hypothesis_lengths = hypothesis_lengths.to(device)
            label = label.to(device)

            # Compute the logits
            logits = model(premise, premise_lengths, hypothesis, hypothesis_lengths)

            # Compute the loss
            loss = criterion(logits, label)

            eval_loss += loss.item()

            # Get the predictions
            predictions = torch.argmax(logits, dim=-1)

            # Count the correct predictions
            correct_predictions += (predictions == label).sum().item()

        # Compute the average evaluation loss
        eval_loss = eval_loss / len(eval_data)

        # Compute the accuracy
        accuracy = correct_predictions / len(eval_data.dataset)

    return eval_loss, accuracy

In [10]:
# Get the test dataset
test_dataset = get_dataset("test")

# Split the dataset into subsets based on sentence length quantiles
short_dataset, medium_dataset, long_dataset = split_dataset_by_quantiles(test_dataset, quantiles=[0.33, 0.66])

print(f"Short dataset size: {len(short_dataset)}")
print(f"Medium dataset size: {len(medium_dataset)}")
print(f"Long dataset size: {len(long_dataset)}")

Short dataset size: 2843
Medium dataset size: 3543
Long dataset size: 3438


In [11]:
# Get the average sentence length for each dataset
for dataset in [short_dataset, medium_dataset, long_dataset]:
    print(f"Average sentence length: {np.mean(dataset['length']):.2f}")

Average sentence length: 15.56
Average sentence length: 21.77
Average sentence length: 32.26


In [12]:
criterion = nn.CrossEntropyLoss()
for name, split in zip(["Short", "Medium", "Long"], [short_dataset, medium_dataset, long_dataset], strict=True):
    # Create a dataloader for the dataset
    eval_data = DataLoader(split, batch_size=32, collate_fn=partial(snli_collate_fn, token_to_idx))

    # Evaluate the model on the dataset
    print(f"Evaluating the model on the {name} dataset...")
    eval_loss, accuracy = evaluate(model, criterion, eval_data, device)

    print(f"{name} dataset evaluation loss: {eval_loss:.4f}, accuracy: {accuracy:.4f}")

Evaluating the model on the Short dataset...


Evaluating: 100%|██████████| 89/89 [00:05<00:00, 17.05it/s]


Short dataset evaluation loss: 0.4065, accuracy: 0.8505
Evaluating the model on the Medium dataset...


Evaluating: 100%|██████████| 111/111 [00:08<00:00, 13.18it/s]


Medium dataset evaluation loss: 0.4432, accuracy: 0.8335
Evaluating the model on the Long dataset...


Evaluating: 100%|██████████| 108/108 [00:12<00:00,  8.87it/s]

Long dataset evaluation loss: 0.4704, accuracy: 0.8255



