In [2]:
import pandas as pd
from utils import InferSentPrediction

## Performance of Sentence Encoder

We evaluate the encoder trained on SNLI dataset (33% random baseline accuracy) on SentEval benchmark.

### SNLI and SentEval accuracies

We report the SentEval micro and macro accuracies. Micro is defined as the average of the SentEval accuracies, and macro is defined as the average of the SenteEval accuracies weighted by the number of samples of the respective tasks.

In [3]:
pd.read_csv("results.csv")

Unnamed: 0,model,dim,SNLI-dev,SNLI-test,transfer-micro,transfer-macro
0,awe,300,65.6,65.9,81.0,78.2
1,lstm,2048,80.4,79.8,79.2,76.7
2,bilstm,4096,78.2,78.6,80.4,78.3
3,bilstm-max,4096,84.1,84.0,81.9,80.8


### Comments
- Increasing the complexity of the model, Average Word Embedding (AWE) to biLSTM with max pooling, improves the performance on SNLI.


- While being mediocre on SNLI, AWE performs very well on the transfer tasks of SentEval (see next table). We believe that is because half of the tasks are "sentimen-related" and positive-negative sentiments information can be obtained using bag-of-words approaches with good results.


- Recurrent models perform badly compared to the AWE approach on SentEval. This might be because the sentences in SNLI are relatively short (mean of 14 words for premise and 8 for hypothesis but with a long tail so median should be even lower) which is a good setup for recurrent models which tend to forget.


- LSTM performing better than biLSTM on SNLI seems to be an outlier. It could be that the biLSTM run was a bad one since the performance on the SentEval benchmark suggests that the embeddings generated by the biLSTM contain more information.


- Using all the tokens hidden representations is beneficial for the performance of the recurrent models for both the SNLI dataset and the SentEval benchmark.


- It is not clear whether the max pooling is the reason for the good performance of the 4th model because it also makes use of all the hidden representations compared to the LSTM and the biLSTM. Trying other pooling layers (average, weighted average, etc.) would be interesting to find out.


### Detailed SentEval performances
We present the detailed performance of the SentEval tasks. All the tasks report the test accuracy except for SICK-R and STS14 which report the mean Pearson coefficient.

#### Sentiment related tasks
- MR: movie review
- CR: product review
- SUBJ: subjectivity status
- MPQA: opinion-polary
- SST: sentiment analysis

#### Semantic related tasks
- TREC: question-type classification
- MRPC: paraphrase detection
- SICK-R: semantic textual similarity
- SICK-E: natural language inference
- STS14: semantic textual similarity

In [4]:
pd.read_csv("results_senteval.csv")

Unnamed: 0,model,MR,CR,SUBJ,MPQA,SST,TREC,MRPC,SICK-R,SICK-E,STS14
0,awe,74.47,77.69,89.45,84.53,78.58,71.0,71.48,0.797,78.16,0.501
1,lstm,70.56,75.71,85.41,84.4,74.79,67.8,72.87,0.855,81.69,0.575
2,bilstm,71.24,77.3,87.55,84.71,76.77,73.8,72.06,0.859,83.3,0.549
3,bilstm-max,74.67,80.42,90.82,85.38,80.56,76.8,73.97,0.877,83.99,0.633


### Comments
- AWE performs better on sentiment-related tasks and worse on tasks for which the semantics of the sentences are more important compared to the recurrent models.


- Using a biLSTM is better for most tasks compared to a simple LSTM. Which confirms that the bidirectionality of the representation provides more detailed information.


- Using all the tokens hidden representations is beneficial for the performance of the recurrent models for all the tasks in the SentEval benchmark.

In [5]:
prediction = InferSentPrediction("bilstm-max")

Global seed set to 42


InferSent(
  (embeddings): Embedding(36698, 300)
  (encoder): MaxBiLSTMModel(
    (lstm): LSTM(300, 2048, bidirectional=True)
  )
  (classifier): Sequential(
    (0): Linear(in_features=16384, out_features=512, bias=True)
    (1): Linear(in_features=512, out_features=3, bias=True)
    (2): Softmax(dim=1)
  )
)


In [6]:
premise = "two dogs play."
hypothesis = "two animals plays together."
prediction.predict(premise, hypothesis)

'entailment'