# InferSent

InferSent provides semantic representations for English sentences to build sentence embeddings. InferSent is trained on the Stanford Natural Language Inference (SNLI) dataset 
and generalizes well to various downstream NLP tasks. The source code can be found in [InferSent github repository](https://github.com/facebookresearch/InferSent)

In [1]:
!git clone https://github.com/facebookresearch/InferSent

Cloning into 'InferSent'...


In [2]:
%cd InferSent

C:\Users\d072726\Downloads\Master-Thesis\Source code\InferSent


In [3]:
# imports

import torch
from models import InferSent
import numpy as np
import pandas as pd
from nltk.translate.bleu_score import sentence_bleu
from nltk.translate.bleu_score import SmoothingFunction
smoother = SmoothingFunction()
from rouge.rouge import rouge_n_sentence_level # pip install easy-rouge
from scipy.stats import pearsonr

In [4]:
# imports for preprocessing

import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize 
import nltk
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\d072726\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\d072726\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

###  Pre-trained models

InferSent1 is pre-trained with Glove pre-trained word embeddings and InferSent2 is pre-trained with FastText pre-trained word embeddings. The pre-trained InferSent models can be downloaded directly [infersent1 here](https://dl.fbaipublicfiles.com/infersent/infersent1.pkl) and [infersent2 here](https://dl.fbaipublicfiles.com/infersent/infersent2.pkl) or can be downloaded using curl command as below.

The models are saved inside the cloned folder 'InferSent'

In [None]:
!curl -Lo infersent1.pkl https://dl.fbaipublicfiles.com/infersent/infersent1.pkl
!curl -Lo infersent2.pkl https://dl.fbaipublicfiles.com/infersent/infersent2.pkl

In [5]:
# Load model

model_version = 2 # 1 for glove based model and 2 for fasttext based model
MODEL_PATH = "infersent%s.pkl" % model_version
params_model = {'bsize': 64, 'word_emb_dim': 300, 'enc_lstm_dim': 2048,
                'pool_type': 'max', 'dpout_model': 0.0, 'version': model_version}
model = InferSent(params_model)
model.load_state_dict(torch.load(MODEL_PATH))

<All keys matched successfully>

In [11]:
# Keep it on CPU or put it on GPU

use_cuda = False # True for GPU devices
model = model.cuda() if use_cuda else model

In [14]:
# If infersent1 -> use GloVe embeddings. If infersent2 -> use FastText embeddings

W2V_PATH = '../../Pretrained_embeddings/glove.840B.300d.txt' if model_version == 1 else '../../Pretrained_embeddings/wiki.en.vec'
model.set_w2v_path(W2V_PATH)

In [15]:
# Load embeddings of K most frequent words

model.build_vocab_k_words(K=1000000)

Vocab size : 1000000


### Load testsets for evaluation

The Automatically generated candidate texts (predictions) from machine translation or text summarization are evaluated against their reference texts. <br> Below are the testsets to be used for evaluation. 

- For **DE-EN** translation, <br> **Candidate-**   '../Testsets/DE-EN/multi30k.test.pred.en.atok'  **Reference-**      '../Testsets/DE-EN/test2016.en.atok'    <br>


- For **RO-EN** translation, <br> **Candidate-**-   '../Testsets/RO-EN/newstest2016_output_1000.en'  **Reference-**    '../Testsets/RO-EN/newstest2016_ref_1000.en'  <br>


- For **CNN-DM** summariation, <br> **Candidate-**   '../Testsets/CNN-DM/preprocessed_1000.pred'  **Reference-** '../Testsets/CNN-DM/preprocessed_1000.ref'  


- For **DUC2003** summarization, <br> **Candidate-**  '../Testsets/DUC2003/duc2003.10_300000-500.txt'  **Reference-** '../Testsets/DUC2003/task1_ref0_duc2003-500.txt'  


- For **Gigaword** summarization (titles), <br>  **Candidate-**  '../Testsets/Gigaword/giga.10_300000_500.txt'  **Reference-** '../Testsets/Gigaword/task1_ref0_giga_500.txt' 

In [22]:
candidate_doc =  '../../Testsets/DE-EN/multi30k.test.pred.en.atok'  
reference_doc = '../../Testsets/DE-EN/test2016.en.atok' 

with  open( candidate_doc ,'r') as cand, open( reference_doc ,'r') as ref:
    candidate_en = cand.readlines()
    reference_en = ref.readlines()   

In [18]:
candidate_en[:5]

['A man in an orange hat presenting something .\n',
 'A Boston traveler runs across lush , green fence in front of a white fence .\n',
 'A girl in a karate uniform is blocking a board with a kick .\n',
 'Five people in winter jackets and helmets are standing in the snow with vials in the background .\n',
 'People moving off the roof of a house .\n']

In [19]:
reference_en[:5]

['A man in an orange hat starring at something .\n',
 'A Boston Terrier is running on lush green grass in front of a white fence .\n',
 'A girl in karate uniform breaking a stick with a front kick .\n',
 'Five people wearing winter jackets and helmets stand in the snow , with snowmobiles in the background .\n',
 'People are fixing the roof of a house .\n']

###  Optional preprocessing

In [20]:
def preprocessing(doc, stop_words_remove=False):
    remove_punctuation = []
    preprocessed_doc = []
    # keep only alphanumeric characters(remove punctuations)
    remove_punctuation = [re.sub(r"[^\w]", " ", sent).lower().strip() for sent in doc] 
    
    if stop_words_remove == True:
        # remove stop words requires lower cased tokens
        stop_words = set(stopwords.words("english"))
        for sent in doc:
            filtered_sentence = [word for word in word_tokenize(sent.lower()) if not word in stop_words]
            preprocessed_doc.append(' '.join(filtered_sentence))
        return preprocessed_doc
    else:
        return remove_punctuation  

In [21]:
# use only if you want to preprocess the sentences

candidate_en = preprocessing(candidate_en, False) # True to remove stopwords, default only removes punctuation
reference_en = preprocessing(reference_en, False) 

### Semantic similarity scores

In [8]:
# gpu mode : >> 1000 sentences/s
# cpu mode : ~100 sentences/s

In [23]:
cand_embedding = model.encode(candidate_en, bsize=128, tokenize=False, verbose=True)
print('nb sentences encoded : {0}'.format(len(cand_embedding)))

Nb words kept : 11577/14737 (78.6%)
Speed : 81.9 sentences/s (cpu mode, bsize=128)
nb sentences encoded : 1000


In [24]:
ref_embedding = model.encode(reference_en, bsize=128, tokenize=False, verbose=True)
print('nb sentences encoded : {0}'.format(len(ref_embedding)))

Nb words kept : 11885/15058 (78.9%)
Speed : 78.2 sentences/s (cpu mode, bsize=128)
nb sentences encoded : 1000


In [25]:
# Cosine similarity function

semantic_scores =[]
for i in range(len(cand_embedding)):
    semantic_scores.append(np.dot(cand_embedding[i],ref_embedding[i]) / (np.linalg.norm(cand_embedding[i])*(np.linalg.norm(ref_embedding[i]))))

### BLEU or ROUGE scores

Use BLEU scores for machine translation evaluation and ROUGE for text summarization evaluation.

In [26]:
# for machine translation evaluation

bleu_scores =[]
for i in range(len(reference_en)):
    bleu_scores.append(sentence_bleu(candidate_en[i],reference_en[i], smoothing_function=smoother.method4))

In [None]:
# for text summarization evaluation

rouge_scores = []
for i in range(len(reference_en)):
    *pr, f = rouge_n_sentence_level(candidate_en[i], reference_en[i], 1) # 2 for ROUGE-2. ROUGE-N, ROUGE-L and ROUGE-W scores can also be obtained.
    rouge_scores.append(f)

### Human annotation scores

Load the human annotation scores from the respective excel files as below,

- For **DE-EN** translation, '../Human annotations/DE-EN.xlsx'


- For **RO-EN** translation, '../Human annotations/RO-EN.xlsx'


- For **CNN-DM** summariation, '../Human annotations/CNN_1000.xlsx'


- For **DUC2003** summarization,  '../Human annotations/DUC2003.xlsx'


- For **Gigaword** summarization (titles),  '../Human annotations/Gigaword.xlsx'


In [28]:
human_annotation = pd.read_excel('../../Human annotations/DE-EN.xlsx')

In [29]:
human_scores = human_annotation.iloc[:, 2].tolist()

### Pearson correlation coefficient

In [30]:
# correlation between human annotated scores and Bleu or ROUGE scores

#pearson correlation value, p-value

pearsonr(human_scores, bleu_scores) # bleu_scores or rouge_scores

(0.3901802069640419, 1.0390848166845472e-37)

In [31]:
# correlation between human annotated scores and semantic similarity scores

pearsonr(human_scores, semantic_scores) # expected to be higher(more correlated) than with Bleu or ROUGE scores

(0.5252636568334494, 5.243967790655788e-72)