# Quantitative results and analysis

In [1]:
# First we show how to read and summarize results from SentEval evaluations
# These will be summarized in more detail in the next section
import os
from pathlib import Path
import pickle
import numpy as np


SENTEVAL_RESULTS_PATH = Path("./senteval_results")

def summarize_senteval_results(results_path):
    metrics = ['MR', 'CR', 'SUBJ', 'MPQA', 'SST2', 'TREC', 'MRPC', 'SICKEntailment']
    # Summarize all results available in SENTEVAL_RESULTS_PATH
    result_fnames = [fname for fname in os.listdir(results_path) if os.path.isfile(results_path / fname)]

    summaries = []
    for fname in result_fnames:
        with open(results_path / fname, 'rb') as handle:
            results = pickle.load(handle)

            dev_accs = [results[metric]["devacc"] for metric in metrics]

            # Calculate 'macro' scores as average of dev accuracies
            micro_avg = np.average(dev_accs)

            # Calculate 'micro' scores as the average of dev accuracies weighted by sample size
            sample_sizes = np.array([results[metric]["ndev"] for metric in metrics])
            weights = sample_sizes / np.sum(sample_sizes)
            macro_avg = np.average(dev_accs, weights=weights)
            
            summaries += [(fname, micro_avg, macro_avg)]

    return summaries

summaries = summarize_senteval_results(SENTEVAL_RESULTS_PATH)

# print summaries
for fname, micro_avg, macro_avg in summaries:
    print(f"| {Path(fname).stem} | {micro_avg:.2f} | {macro_avg:.2f} |")

| baseline-avg | 78.13 | 79.71 |
| baseline-zero | 78.27 | 79.64 |
| bilstm-avg | 77.92 | 78.46 |
| bilstm-zero | 77.73 | 78.47 |
| bimaxlstm-avg | 79.52 | 79.99 |
| bimaxlstm-zero | 79.63 | 80.41 |
| unilstm-avg | 74.01 | 74.76 |
| unilstm-zero | 74.48 | 75.11 |


## Results

The table below shows the results obtained for 4 types of encoders (baseline, unilstm, bilstm, bimaxlstm). For more information on the encoders we refer to `/models/sentence_encoders.py` and their default parameters. 

As GloVe does not come with a pre-trained embedding for unknown tokens, we initially carried out experiments using the (300-dimensional) zero-vector. However, GloVe is not centered around zero. So, it is not obvious, that the zero-vector is the best options to choose from. An arguably even more neutral embedding is the vector obtained by averaging over all embeddings in the GloVe dataset. 

To evalaute the impact of this choice, we have tested all models with both of these embeddings for the unknown-token.

The table below summarizes the results our best models have obtained on SNLI and SentEval datasets.

| Encoder       | unknown-emb   | Parameters    | Classifier Parameters | SNLI val acc  | SentEval micro    | SentEval macro|
| -----------   | -----------   |-----------    | -----------           | -----------   | -----------       | -----------   |
| baseline      | zero          | 0             | 0.6 M                 | 73.82         | 78.27             | 79.64         |
| unilstm       | zero          | 19.3 M        | 4.2 M                 | 83.65         | 74.48             | 75.11         |
| bilstm        | zero          | 38.5 M        | 8.4 M                 | 83.46         | 77.73             | 78.47         |
| bimaxlstm     | zero          | 38.5 M        | 8.4 M                 | 86.89         | **79.63**         | **80.41**     |
| baseline      | average       | 0             | 0.6 M                 | 74.17         | 78.13             | 79.71         |
| unilstm       | average       | 19.3 M        | 4.2 M                 | 83.63         | 77.92             | 78.46         |
| bilstm        | average       | 38.5 M        | 8.4 M                 | 83.40         | 79.52             | 79.99         |
| bimaxlstm     | average       | 38.5 M        | 8.4 M                 | **86.93**     | 74.01             | 74.76         |

The definitions of "SentEval micro" and "SentEval macro" are defined in line with Table 3 in Conneau et al. (2017):
> In this section, we refer to ”micro” and ”macro” averages of development set (dev) results on transfer tasks whose metrics is accuracy: we compute a ”macro” aggregated score that corresponds to the classical average of dev accuracies, and the ”micro” score that is a sum of the dev accuracies, weighted by the number of dev samples.

## Analysis

On the SNLI dataset, the order of the models from highest to lowest scoring are, regardless of the embedding being used for the unknown-token: bimaxlstm, bilstm, unilstm, baseline. This is in line with our expectations from observing the results obtained by Conneau et al. (2017). What is surprising, however, is that our validation accuracies are higher than theirs.

On the SentEval tasks the performances are unexpected throughout. The baseline scores surprisingly high, better than the unilstm and, for the zero unknown-embeddings, even better than the bilstm. The best performances, however, are obtained by the bimaxlstm with zero unknown-embeddings. Surprisingly, again

models' performances are, in order bimaxlstm performs best, followed by the bilstm, worst, foll for regardless of the unknown-embedding being used.

When comparing our results to those in Table 3 from Conneau et al. (2017), there are clear differences. Our validation accuracies for the SNLI dataset are, on average, higher, whereas our scores for the SentEval datasets are lower for every model. 
<!-- This, however, is not surprising, as we have decided to align our  -->

Below is a screenshot of the training curves, which are publicly available at https://wandb.ai/kieron-kretschmar/representations-from-nli

![Training curves](./training_curves.jpg "Training curves")

## Qualitative results and error analysis

A qualitative discussion of the models and their limitations is included in `demo.ipynb`.