# Practical 1

See REPO - https://github.com/DylanAmadan/Bigold


The first practical of the Advanced Topics in Computational Semantics course delves into the realm of learning general-purpose sentence representations within the context of natural language inference (NLI). Our objective encompasses:

- Implementing four neural models to classify sentence pairs based on their relation.
- Training these models utilizing the Stanford Natural Language Inference (SNLI) corpus (Bowman et al., 2015).
- Evaluating the trained models using the SentEval framework (Conneau and Kiela, 2018).

NLI involves discerning entailment or contradiction relationships between premises and hypotheses, a fundamental aspect of understanding language. This assignment emphasizes pretraining a sentence encoder on NLI and subsequently evaluating its efficacy on diverse natural language tasks.

#### Deliverables
Model weights:
https://drive.google.com/drive/folders/14oEy87KHCX-2mIeCVgAIfHWtjlFk50KL

Tensorboards: 
https://drive.google.com/drive/folders/10FPdXgQOaOPjXgB0QnNdk9U3m195q9IE

SNLI Analysis: Unfortunately SNLI Training was not finished at time of deadline, will upload next day.

Senteval Analysis: Did not get to this part of the assignment :(

In the report, you should include an

- overview of your results (both SNLI and the per-task SentEval scores), and draw conclusions based on the error analysis done in the notebook.
- You can look at questions like:
    - Why is model A performing better than model B ? Where do the models fail?
    - What information does the sentence embedding represent, and what information might be lost?
    - Additional points will be awarded for further research questions that you identify and answer yourself.

In particular, we are looking for a clear motivation of your research questions, novelty and
clarity of presentation. You can also include screenshots from tensorboard if suitable. Try to
find good ways to visualize and present your findings. In the report, you should also present
an answer to the following question: Given two examples,

Premise - “Two men sitting in the sun”
Hypothesis - “Nobody is sitting in the shade”
Label - Neutral (likely predicts contradiction)
Premise - “A man is walking a dog”
Hypothesis - “No cat is outside”
Label - Neutral (likely predicts contradiction)
Can you think of a possible reason why the model would fail in such cases?

You should do this practical individually and submit a compressed zip of the deliverables with
the title ATCS-Practical1-FullName to Canvas.

Here we have an example demonstration of the models at inference, feel free to swap out the model checkpoints

### LSTM

In [63]:
import torch
from arch import LSTM, Classyfer 
from data_loader import load_embeddings  

# Parameters for the model (set these according to your model's training configuration)
vocab_size = 37179  # Total number of distinct tokens in your vocabulary
embedding_dim = 300  # Dimension of the word embeddings
hidden_size = 256  # Number of features in the hidden state of the LSTM
output_dim = 3  # Number of output classes (e.g., entailment, contradiction, neutral)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load the pretrained embeddings (make sure this matches your training setup)
pretrained_embeddings = load_embeddings("data/embedding_matrix.pickle")  # Adjust the path as needed

# Initialize the encoder
encoder = LSTM(vocab_size, embedding_dim, hidden_size, pretrained_embeddings, device)
encoder.output_size = 256  # Make sure this matches the LSTM output size used during training

# Initialize the Classyfer model with the correct input size for the classifier
model = Classyfer(encoder, mlp_hidden_size=512, num_classes=output_dim, device=device)
model.to(device)

# Load the model checkpoint
checkpoint_path = "checkpoints/first_epoch_9.pth"  # Adjust the path as needed
checkpoint = torch.load(checkpoint_path, map_location=device)
model.load_state_dict(checkpoint)
model.eval()

# Now the model is ready for inference or further evaluation


Classyfer(
  (encoder): LSTM(
    (embedding): Embedding(37179, 300, padding_idx=1)
    (lstm): LSTM(300, 256, batch_first=True)
  )
  (classifier): Sequential(
    (0): Linear(in_features=1024, out_features=512, bias=True)
    (1): Tanh()
    (2): Linear(in_features=512, out_features=512, bias=True)
    (3): Tanh()
    (4): Linear(in_features=512, out_features=3, bias=True)
  )
)

In [59]:
import torch
from arch import LSTM, Classyfer 
from data_loader import load_embeddings  

# Parameters for the model (set these according to your model's training configuration)
vocab_size = 37179  # Total number of distinct tokens in your vocabulary
embedding_dim = 300  # Dimension of the word embeddings
hidden_size = 256  # Number of features in the hidden state of the LSTM
output_dim = 3  # Number of output classes (e.g., entailment, contradiction, neutral)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load the pretrained embeddings (make sure this matches your training setup)
pretrained_embeddings = load_embeddings("data/embedding_matrix.pickle")  # Adjust the path as needed

# Initialize the encoder
encoder = LSTM(vocab_size, embedding_dim, hidden_size, pretrained_embeddings, device)
encoder.output_size = 256  # Make sure this matches the LSTM output size used during training

# Initialize the Classyfer model with the correct input size for the classifier
model = Classyfer(encoder, mlp_hidden_size=512, num_classes=output_dim, device=device)
model.to(device)

# Load the model checkpoint
checkpoint_path = "checkpoints/first_epoch_9.pth"  # Adjust the path as needed
checkpoint = torch.load(checkpoint_path, map_location=device)
model.load_state_dict(checkpoint)
model.eval()

# Now the model is ready for inference or further evaluation


Classyfer(
  (encoder): LSTM(
    (embedding): Embedding(37179, 300, padding_idx=1)
    (lstm): LSTM(300, 256, batch_first=True)
  )
  (classifier): Sequential(
    (0): Linear(in_features=1024, out_features=512, bias=True)
    (1): Tanh()
    (2): Linear(in_features=512, out_features=512, bias=True)
    (3): Tanh()
    (4): Linear(in_features=512, out_features=3, bias=True)
  )
)

In [60]:
premise = "Harvey Specter is solving a case and he thinks about bribing the police."
hypothesis = "Harvey corrupts the police."

import nltk
import json

# First we tokenize the sentences
premise = nltk.tokenize.word_tokenize(premise)
hypothesis = nltk.tokenize.word_tokenize(hypothesis)

#Then we lowercase all the tokens
premise = [words.lower() for words in premise]
hypothesis = [words.lower() for words in hypothesis]

# We load the vocabulary
with open("data/data.json", 'r') as file:
    vocab = json.load(file)

from data import prepare_example

#This function will assign indexes to token, so that they are recognizes by the embedding table that was used to train the model.
premise = prepare_example(premise, vocab)
hypothesis = prepare_example(hypothesis, vocab)

print(premise)
print(hypothesis)


tensor([[35960,     0,    12,  8140,     2,  2828,    26,    73,  7664,   761,
             0,    35,  2271,    11]])
tensor([[35960,     0,    35,  2271,    11]])


In [61]:
import torch.nn as nn

softmax = nn.Softmax(dim=1)
with torch.no_grad():  
    output = model(premise, hypothesis)
    probabilities = softmax(output)
    print(probabilities)


Combined features shape: torch.Size([1, 1024])
tensor([[0.8275, 0.1419, 0.0305]])


As we can see that the model gives higher probability to the first class, which refers to entailment

## Part 2: Error Analysis 

We conduct error analysis to identify where the models succeed and fail in predicting the entailment relationships between premises and hypotheses.

In [40]:
import nltk
import json
from data import prepare_example

# Define initial sentences for premise and hypothesis
premise_1 = "Two men sitting in the sun"
hypothesis_1 = "Nobody is sitting in the shade"

premise_2 = "A man is walking a dog"
hypothesis_2 = "No cat is outside"

# Tokenization of the sentences
tokens_premise_1 = nltk.tokenize.word_tokenize(premise_1)
tokens_hypothesis_1 = nltk.tokenize.word_tokenize(hypothesis_1)

tokens_premise_2 = nltk.tokenize.word_tokenize(premise_2)
tokens_hypothesis_2 = nltk.tokenize.word_tokenize(hypothesis_2)

# Convert tokens to lowercase
tokens_premise_1 = [token.lower() for token in tokens_premise_1]
tokens_hypothesis_1 = [token.lower() for token in tokens_hypothesis_1]

tokens_premise_2 = [token.lower() for token in tokens_premise_2]
tokens_hypothesis_2 = [token.lower() for token in tokens_hypothesis_2]

# Load vocabulary from a JSON file
with open("data/data.json", 'r') as file:
    vocabulary = json.load(file)

# Prepare examples by converting tokens to indices using the vocabulary
indexed_premise_1 = prepare_example(tokens_premise_1, vocabulary)
indexed_hypothesis_1 = prepare_example(tokens_hypothesis_1, vocabulary)

indexed_premise_2 = prepare_example(tokens_premise_2, vocabulary)
indexed_hypothesis_2 = prepare_example(tokens_hypothesis_2, vocabulary)

# Output the processed and indexed premises and hypotheses
print(indexed_premise_1)
print(indexed_hypothesis_1)
print(indexed_premise_2)
print(indexed_hypothesis_2)


tensor([[  83,  452,  102,   41,   35, 1370]])
tensor([[ 317,   12,  102,   41,   35, 3608]])
tensor([[  2,  55,  12, 252,   2, 377]])
tensor([[309, 383,  12, 204]])


In [46]:
import torch.nn as nn

softmax = nn.Softmax(dim=1)
with torch.no_grad():  # Ensure no gradients are computed during inference
    output = model(indexed_premise_1, indexed_hypothesis_1)
    probabilities = softmax(output)
    print(probabilities)


Combined features shape: torch.Size([1, 1024])
tensor([[2.7491e-04, 5.1067e-03, 9.9462e-01]])


In [47]:
import torch.nn as nn

softmax = nn.Softmax(dim=1)
with torch.no_grad():  # Ensure no gradients are computed during inference
    output = model(indexed_premise_2, indexed_hypothesis_2)
    probabilities = softmax(output)
    print(probabilities)


Combined features shape: torch.Size([1, 1024])
tensor([[9.4919e-07, 7.0206e-05, 9.9993e-01]])


The model appears to struggle with interpreting scenarios involving negation and the absence of an action or entity. For instance, in the first example, the premise "Two men sitting in the sun" doesn't necessarily imply that "Nobody is sitting in the shade". However, the model seems to interpret the presence of negation words like "Nobody" as indicating a contradiction. This might suggest that the model lacks a nuanced understanding of how negation interacts with different contexts to produce a neutral outcome rather than a contradiction.

Similarly, the second example "A man is walking a dog" being related to "No cat is outside" presents a case where the absence mentioned in the hypothesis doesn't logically contradict the premise. The model's decision to predict a contradiction instead of a neutral response might indicate a lack of understanding of scenarios where the presence of one entity doesn't necessarily exclude the presence of another. This could be a limitation in the model’s training where it was not exposed to enough diverse examples that specifically teach this kind of logical separation.

These failures can often stem from a model’s training data not adequately representing complex linguistic structures like negation or from an embedding layer that doesn't capture the necessary contextual cues to distinguish between unrelated statements effectively. The model may also be influenced by biases in the dataset, where the presence of certain keywords biases predictions towards contradiction.

Improving the model's performance on such tasks could involve enriching the training set with more examples that challenge its understanding of context and negation, possibly incorporating synthetic data crafted to specifically address these weaknesses. Moreover, enhancing the model's architecture to better integrate broader contextual and world knowledge could also help in better predicting neutral labels in cases where the hypothesis is neither clearly entailed nor directly contradicted by the premise.

To further explore the hypothesis that the model may struggle with understanding negation and context-specific nuances, especially in distinguishing between neutral and contradiction labels, we can design a small set of experiments. 

Uno
Objective: To assess how well the model understands negation in different contexts.

Procedure: Create a set of test pairs with clear negation but varying contexts to see if the model's predictions change based on context.
Include pairs where the negation leads to a contradiction, neutral, and entailment outcomes based on logical reasoning.

Dos
Objective: To investigate if changing the context around a negation affects model predictions.

Procedure: Use the same negation in different contexts to see if the model consistently interprets the negation or if context shifts its interpretation.
Evaluate how changes in the surrounding context influence the prediction.


In [68]:
import nltk
import json
import torch
import torch.nn as nn
from data import prepare_example

# Define premises and hypotheses for the experiments
test_cases = [
    ("The room was crowded.", "No one was in the room."),
    ("The room was crowded.", "No one was outside."),
    ("She was alone at home.", "Nobody was with her."),
    ("She was alone at home.", "Nobody was at the park.")
]

# Function to prepare data and make predictions
def prepare_and_predict(premise, hypothesis):
    # Tokenize and preprocess
    tokens_premise = nltk.tokenize.word_tokenize(premise)
    tokens_hypothesis = nltk.tokenize.word_tokenize(hypothesis)
    
    # Convert tokens to lowercase
    tokens_premise = [token.lower() for token in tokens_premise]
    tokens_hypothesis = [token.lower() for token in tokens_hypothesis]

    # Load vocabulary from a JSON file
    with open("data/data.json", 'r') as file:
        vocabulary = json.load(file)

    # Prepare examples by converting tokens to indices using the vocabulary
    indexed_premise = prepare_example(tokens_premise, vocabulary)
    indexed_hypothesis = prepare_example(tokens_hypothesis, vocabulary)

    # Use the model to predict
    softmax = nn.Softmax(dim=1)
    with torch.no_grad():  # Ensure no gradients are computed during inference
        output = model(indexed_premise, indexed_hypothesis)
        probabilities = softmax(output)
        return probabilities

# Test each case
for premise, hypothesis in test_cases:
    result = prepare_and_predict(premise, hypothesis)
    print(f"Premise: {premise}")
    print(f"Hypothesis: {hypothesis}")
    print("Probabilities:", result)
    print()


Combined features shape: torch.Size([1, 1024])
Premise: The room was crowded.
Hypothesis: No one was in the room.
Probabilities: tensor([[0.0030, 0.0106, 0.9864]])

Combined features shape: torch.Size([1, 1024])
Premise: The room was crowded.
Hypothesis: No one was outside.
Probabilities: tensor([[3.0224e-04, 1.6048e-03, 9.9809e-01]])

Combined features shape: torch.Size([1, 1024])
Premise: She was alone at home.
Hypothesis: Nobody was with her.
Probabilities: tensor([[0.3350, 0.1502, 0.5148]])

Combined features shape: torch.Size([1, 1024])
Premise: She was alone at home.
Hypothesis: Nobody was at the park.
Probabilities: tensor([[0.0014, 0.0402, 0.9584]])



A) Experiment Uno (Understanding Negation in Different Contexts):

The first and second tests, involving the premise "The room was crowded," examine how the model handles negation when it either directly contradicts the premise or is contextually unrelated. Both hypotheses included negations ("No one was in the room" and "No one was outside"), but their relations to the premise differed. The model incorrectly favored entailment in both cases, indicating a failure to properly handle negation that either contradicts or is unrelated to the premise. This suggests that the model may struggle to assess the logical impact of negation within the given context, favoring entailment perhaps due to training biases or limitations in its learning of contextual nuances.

B) Experiment Dos (Impact of Context on Negation Interpretation):

The third and fourth tests using "She was alone at home" further explored this by presenting negations with different contextual relevances ("Nobody was with her" directly supports the premise, while "Nobody was at the park" is contextually irrelevant). The model's predictions revealed a tendency to overlook the contextual irrelevance, as it incorrectly predicted high probabilities for entailment where neutrality was expected. This indicates issues with the model's ability to differentiate when contextual changes around a negation should alter its interpretation.

From these results, it's clear that the model does exhibit the hypothesized weaknesses:

- There appears to be a consistent bias toward predicting entailment when faced with negations, regardless of whether they contradict or are unrelated to the premise.
- The model struggles with properly interpreting negation in different contexts, especially distinguishing between contradicting and unrelated scenarios.