# Results, analysis and demonstration

## Quantitative

In this section we present and discuss quantitative results obtained by our models.

### Reading the SentEval results
As a prerequisite we show how to read and summarize results from SentEval evaluations. These will be discussed in more detail in the next section.

In [1]:
import os
from pathlib import Path
import pickle
import numpy as np


SENTEVAL_RESULTS_PATH = Path("./senteval_results")

def summarize_senteval_results(results_path):
    metrics = ['MR', 'CR', 'SUBJ', 'MPQA', 'SST2', 'TREC', 'MRPC', 'SICKEntailment']
    # Summarize all results available in SENTEVAL_RESULTS_PATH
    result_fnames = [fname for fname in os.listdir(results_path) if os.path.isfile(results_path / fname)]

    summaries = []
    for fname in result_fnames:
        with open(results_path / fname, 'rb') as handle:
            results = pickle.load(handle)

            dev_accs = [results[metric]["devacc"] for metric in metrics]

            # Calculate 'macro' scores as average of dev accuracies
            micro_avg = np.average(dev_accs)

            # Calculate 'micro' scores as the average of dev accuracies weighted by sample size
            sample_sizes = np.array([results[metric]["ndev"] for metric in metrics])
            weights = sample_sizes / np.sum(sample_sizes)
            macro_avg = np.average(dev_accs, weights=weights)
            
            summaries += [(fname, micro_avg, macro_avg)]

    return summaries

summaries = summarize_senteval_results(SENTEVAL_RESULTS_PATH)

# print summaries
for fname, micro_avg, macro_avg in summaries:
    print(f"| {Path(fname).stem} | {micro_avg:.2f} | {macro_avg:.2f} |")

| baseline-avg | 78.13 | 79.71 |
| baseline-zero | 78.27 | 79.64 |
| bilstm-avg | 77.92 | 78.46 |
| bilstm-zero | 77.73 | 78.47 |
| bimaxlstm-avg | 79.52 | 79.99 |
| bimaxlstm-zero | 79.63 | 80.41 |
| unilstm-avg | 74.01 | 74.76 |
| unilstm-zero | 74.48 | 75.11 |


### Quantitative Results

The table below shows the results obtained for 4 types of encoders (baseline, unilstm, bilstm, bimaxlstm). For more information on the encoders we refer to `/models/sentence_encoders.py` and their default parameters. 

As GloVe does not come with a pre-trained embedding for unknown tokens, we initially carried out experiments using the (300-dimensional) zero-vector. However, GloVe is not centered around zero. So, it is not obvious, that the zero-vector is the best options to choose from. An arguably even more neutral embedding is the vector obtained by averaging over all embeddings in the GloVe dataset. 

To evalaute the impact of this choice, we have tested all models with both of these embeddings for the unknown-token.

The table below summarizes the results our best models have obtained on SNLI and SentEval datasets.

| Encoder       | unknown-emb   | Parameters    | Classifier Parameters | SNLI val acc  | SentEval micro    | SentEval macro|
| -----------   | -----------   |-----------    | -----------           | -----------   | -----------       | -----------   |
| baseline      | zero          | 0             | 0.6 M                 | 73.82         | 78.27             | 79.64         |
| unilstm       | zero          | 19.3 M        | 4.2 M                 | 83.65         | 74.48             | 75.11         |
| bilstm        | zero          | 38.5 M        | 8.4 M                 | 83.46         | 77.73             | 78.47         |
| bimaxlstm     | zero          | 38.5 M        | 8.4 M                 | 86.89         | **79.63**         | **80.41**     |
| baseline      | average       | 0             | 0.6 M                 | 74.17         | 78.13             | 79.71         |
| unilstm       | average       | 19.3 M        | 4.2 M                 | 83.63         | 74.01             | 74.76         |
| bilstm        | average       | 38.5 M        | 8.4 M                 | 83.40         | 77.92             | 78.46         |
| bimaxlstm     | average       | 38.5 M        | 8.4 M                 | **86.93**     | 79.52             | 79.99         |

The definitions of "SentEval micro" and "SentEval macro" are defined in line with Table 3 in Conneau et al. (2017):
> In this section, we refer to ”micro” and ”macro” averages of development set (dev) results on transfer tasks whose metrics is accuracy: we compute a ”macro” aggregated score that corresponds to the classical average of dev accuracies, and the ”micro” score that is a sum of the dev accuracies, weighted by the number of dev samples.

Below is a screenshot of the training metrics, which are publicly available at https://wandb.ai/kieron-kretschmar/representations-from-nli

<img src="images/training_curves.jpg" width="1200" alt="Training curves">

### Analysis of quantitative results

#### Performances on SNLI 
On the SNLI dataset, the order of the models from highest to lowest scoring are, regardless of the embedding being used for the unknown-token: bimaxlstm, bilstm, unilstm, baseline. This order also roughly corresponds to the parameter sizes of the models. 

What is surprising, however, is that our validation accuracies are higher than theirs.

#### Performances on SentEval
On the SentEval tasks the performances are unexpected throughout. The baseline scores surprisingly high, better than the unilstm and the bilstm. The reason for this is not clear to us, and could be investigated in future work.

The best performances, are obtained by the bimaxlstm with zero unknown-embeddings.

#### Comparison with results from Table 3 from Conneau et al. (2017)
Conneau et al. (2017) have performed similar experiments. Our unilstm refers to their LSTM, and our bilstm corresponds to their BiLSTM-MAX. The other models have not been reproduced.

When comparing our results to those of the original authors, there are clear differences. Our validation accuracies for the SNLI dataset are roughly 2% higher for all comparable models, whereas our scores for the SentEval datasets are usually 4-6% lower.

We suspect that one way to obtain higher scores on the SentEval task lies in changing the alignment of the vocabulary. We have chosen to align the vocabulary only with the words from the SNLI corpus for which GloVe embeddings are available. We have chosen not to include words outside SNLI that appear in downstream-tasks because this project is about evaluating *general* sentence representations for which the downstream tasks (e.g. SentEval) are unknown. We would expect the performance on SentEval to go up if the test vocabulary was known during training. We suspect this because even though the encoder would never see those exact tokens during training, it might learn useful information from having seen *similar* tokens' embeddings.

#### The choice for the unknown token
We could not determine a significant change in performances between choosing the average GloVe embedding or the zero-vector for the unknown-token's embedding. 

One additional version of the experiment might use a *weighted* average of all GloVe embeddings instead, with weights being determined by the frequency of the corresponding token in e.g. the training corpus. This is left for future work.

## Qualitative results and error analysis

In this section we present a qualitative discussion of the unilstm (zero) model and its limitations. The model has been trained on the NLI task with `train.py`. We load the pre-trained model and evaluate its predictions on the NLI task, where it is given a premise and a hypothesis and has to predict whether the hypothesis is either entailed by, contradicting or neutral towards the premise.

#### Setup

In [2]:
import os
from pathlib import Path
import pickle
import numpy as np
import torch
from models import NLIModel
from data import SNLIDataModule
import nltk

# Download nltk prerequisite for tokenization if not available already
try:
    nltk.data.find('tokenizers/punkt')
except LookupError:
    nltk.download('punkt')

# The checkpoints and vocabulary have to be downloaded seperately, see README.md for further instructions
ENCODER_CHECKPOINT_PATH = Path("./checkpoints/best/unilstm-zero.ckpt")
CLASSIFIER_CHECKPOINT_PATH = Path("./checkpoints/best/unilstm-zero-classifier.ckpt")
VOCABULARY_PATH = Path("./cache/vocab-zero.pkl")

In [3]:
# Load 
print(f"Loading encoder from {ENCODER_CHECKPOINT_PATH}")
encoder = torch.load(ENCODER_CHECKPOINT_PATH)
encoder.eval()
print(f"Encoder loaded.")

# Load classifier
print(f"Loading classifier from {CLASSIFIER_CHECKPOINT_PATH}")
classifier = torch.load(CLASSIFIER_CHECKPOINT_PATH)
classifier.eval()
print(f"Classifier loaded.")

# Load vocab
print(f"Loading vocabulary from {VOCABULARY_PATH}")
with open(VOCABULARY_PATH, 'rb') as handle:
    vocab = pickle.load(handle)
print(f"Vocabulary loaded.")

# Create model with full pipeline for NLI inference
model = NLIModel(encoder=encoder, classifier=classifier)

Loading encoder from checkpoints\best\unilstm-zero.ckpt
Encoder loaded.
Loading classifier from checkpoints\best\unilstm-zero-classifier.ckpt
Classifier loaded.
Loading vocabulary from cache\vocab-zero.pkl
Vocabulary loaded.


  rank_zero_warn(
  rank_zero_warn(


In [4]:
# Define a function for doing inference and printing results in a pretty way
def predict_example(nli_model, premise, hypothesis, label = "unspecified"):
    label_meanings = {
        0: "Entailment",
        1: "Neutral",
        2: "Contradiction",
    }

    # Create batch containing the example in the form expected by the model
    premise_tensor = torch.stack([vocab.get_embedding(token) for token in nltk.word_tokenize(premise.lower())], dim=0)
    hypothesis_tensor = torch.stack([vocab.get_embedding(token) for token in nltk.word_tokenize(hypothesis.lower())], dim=0)

    premise_length = premise_tensor.shape[0]
    hypothesis_length = hypothesis_tensor.shape[0]

    example_batch = (
        (torch.nn.utils.rnn.pad_sequence([premise_tensor], batch_first=True), [premise_length]),
        (torch.nn.utils.rnn.pad_sequence([hypothesis_tensor], batch_first=True), [hypothesis_length]),
        None
    )

    logits = nli_model.forward(example_batch)[0]
    predicted_class = logits.argmax().item()

    print(f"Premise: \"{premise}\"")
    print(f"Hypothesis: \"{hypothesis}\"")
    print(f"Correct label: {label}")
    print(f"The model predicts: {label_meanings[predicted_class]}")

#### A simple example
In this section we demonstrate the usage of our models on a simple example. The premise and hypothesis can be modified.

In [5]:
premise = "It is snowing outside my house"
hypothesis = "Today is a great day to go swimming at the beach"
label = "Contradiction"
predict_example(model, premise, hypothesis, label=label)

Premise: "It is snowing outside my house"
Hypothesis: "Today is a great day to go swimming at the beach"
Correct label: Contradiction
The model predicts: Contradiction


#### Opposites and negations
To establish limitations of the model, we choose more difficult examples:

In [6]:
premise = "Two men sitting in the sun"
hypothesis = "Nobody is sitting in the shade"
label = "Neutral"
predict_example(model, premise, hypothesis, label)

Premise: "Two men sitting in the sun"
Hypothesis: "Nobody is sitting in the shade"
Correct label: Neutral
The model predicts: Contradiction


In [7]:
premise = "A man is walking a dog"
hypothesis = "No cat is outside"
label = "Neutral"
predict_example(model, premise, hypothesis, label)

Premise: "A man is walking a dog"
Hypothesis: "No cat is outside"
Correct label: Neutral
The model predicts: Contradiction


In these examples, the hypothesis contains a negated statement about something that sounds opposite of the premise (i.e. "sun"<->"Nobody ... shade" and "dog"<->"No cat"). In both cases the model falsely predicts the relationship as a contradiction both relationship. 

Our intuition is that the model not picking up the negation, but predicting a contradiction because of the opposite-aspect.

(On a sidenote, a cognitive bias occuring in humans, "What You See is All There Is" (WYSIATI), described by Daniel Kahneman in his book "Thinking, Fast and Slow" could also be applied to explain this behaviour. Because if the premise was _all_ there is and nothing else, then the hypothesis would be true in both examples.)

To get a better understanding of whether our intuition is right, we make 3 types of changes to the examples and observe the results.

1. When we remove the negations in the hypotheses, we still observe the same mistake:

In [8]:
premise = "Two men sitting in the sun"
hypothesis = "Somebody is sitting in the shade"
label = "Neutral"
predict_example(model, premise, hypothesis, label)

Premise: "Two men sitting in the sun"
Hypothesis: "Somebody is sitting in the shade"
Correct label: Neutral
The model predicts: Contradiction


In [9]:
premise = "A man is walking a dog"
hypothesis = "A cat is outside"
label = "Neutral"
predict_example(model, premise, hypothesis, label)

Premise: "A man is walking a dog"
Hypothesis: "A cat is outside"
Correct label: Neutral
The model predicts: Contradiction


2. When we change the terms in the hypotheses that sound opposite to a term in the premises ("shade" for "sun" and "cat" for "dog"), we still observe the same mistake:

In [10]:
premise = "Two men sitting in the sun"
hypothesis = "Nobody is sitting in a chair"
label = "Neutral"
predict_example(model, premise, hypothesis, label)

Premise: "Two men sitting in the sun"
Hypothesis: "Nobody is sitting in a chair"
Correct label: Neutral
The model predicts: Contradiction


In [11]:
premise = "A man is walking a dog"
hypothesis = "No tree is outside"
label = "Neutral"
predict_example(model, premise, hypothesis, label)

Premise: "A man is walking a dog"
Hypothesis: "No tree is outside"
Correct label: Neutral
The model predicts: Contradiction


3. When we apply both adaptations at the same time, the model gets it right:

In [12]:
premise = "Two men sitting in the sun"
hypothesis = "Somebody is sitting in a chair"
label = "Neutral"
predict_example(model, premise, hypothesis, label)

Premise: "Two men sitting in the sun"
Hypothesis: "Somebody is sitting in a chair"
Correct label: Neutral
The model predicts: Neutral


In [13]:
premise = "A man is walking a dog"
hypothesis = "A cloud is outside"
label = "Neutral"
predict_example(model, premise, hypothesis, label)

Premise: "A man is walking a dog"
Hypothesis: "A cloud is outside"
Correct label: Neutral
The model predicts: Neutral


### Conclusion of qualitative analysis
Our conclusion from these examples is that while the model does perform well on simple examples, as further supported by the quantitative results, the model can easily be fooled. We have gathered sporadic evidence that the model is particularly susceptive to giving wrong predictions when negations and opposing words are involved. 