# InferSent

InferSent provides semantic representations for English sentences to build sentence embeddings. InferSent is trained on the Stanford Natural Language Inference (SNLI) dataset 
and generalizes well to various downstream NLP tasks. The source code can be found in [InferSent github repository](https://github.com/facebookresearch/InferSent)

In [19]:
%cd ../InferSent

C:\Users\d072726\Documents\Thesis\InferSent


In [20]:
# imports

from random import randint
import numpy as np
import torch
from models import InferSent
import numpy as np
import pandas as pd
from nltk.translate.bleu_score import sentence_bleu
from nltk.translate.bleu_score import SmoothingFunction
smoother = SmoothingFunction()
from rouge.rouge import rouge_n_sentence_level # pip install easy-rouge
from scipy.stats import pearsonr

In [21]:
# imports for preprocessing
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize 
import nltk
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\d072726\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\d072726\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

###  Pre-trained models

InferSent1 is pre-trained with Glove pre-trained word embeddings and InferSent2 is pre-trained with FastText pre-trained word embeddings. The pre-trained InferSent models can be downloaded directly [infersent1 here](https://dl.fbaipublicfiles.com/infersent/infersent1.pkl) and [infersent2 here](https://dl.fbaipublicfiles.com/infersent/infersent2.pkl) or can be downloaded using curl command as below.

In [None]:
!curl -Lo ../InferSent/encoder/infersent1.pkl https://dl.fbaipublicfiles.com/infersent/infersent1.pkl
!curl -Lo ../InferSent/encoder/infersent2.pkl https://dl.fbaipublicfiles.com/infersent/infersent2.pkl

In [33]:
# Load model

model_version = 2 # 1 for glove based model and 2 for fasttext based model
MODEL_PATH = "../InferSent/encoder/infersent%s.pkl" % model_version
params_model = {'bsize': 64, 'word_emb_dim': 300, 'enc_lstm_dim': 2048,
                'pool_type': 'max', 'dpout_model': 0.0, 'version': model_version}
model = InferSent(params_model)
model.load_state_dict(torch.load(MODEL_PATH))

<All keys matched successfully>

In [34]:
# Keep it on CPU or put it on GPU

use_cuda = False # True for GPU devices
model = model.cuda() if use_cuda else model

In [35]:
# If infersent1 -> use GloVe embeddings. If infersent2 -> use FastText embeddings

W2V_PATH = '../pretrained_embeddings/glove.840B.300d.txt' if model_version == 1 else '../pretrained_embeddings/wiki.en.vec'
model.set_w2v_path(W2V_PATH)

In [36]:
W2V_PATH

'../pretrained_embeddings/wiki.en.vec'

In [37]:
# Load embeddings of K most frequent words

model.build_vocab_k_words(K=100000)

Vocab size : 100000


### Load testsets for evaluation

The Automatically generated texts (predictions) from machine translation or text summarization are evaluated against their reference texts. <br> Below are the testsets to be used for evaluation. 

- For **DE-EN** translation, <br>  **reference-** '../testsets/de-en/test2016.en.atok'   **prediction-** '../testsets/de-en/multi30k.test.pred.en.atok' <br>


- For **RO-EN** translation, <br>  **reference-** '../testsets/ro-en/newstest2016_ref_1000.en'  **prediction-**- '../testsets/ro-en/newstest2016_output_1000.en'<br>


- For **giga word** summarization(titles), <br>  **reference-** '../testsets/giga/task1_ref0_giga_500.txt'  **prediction-**'../testsets/giga/giga.10_300000_500.txt'


- For **CNN-DM** summariation, <br>  **reference-** '../testsets/cnn/preprocessed_1000.ref'  **prediction-** '../testsets/cnn/preprocessed_1000.pred'


- For **Duc 2003** summarization, <br>  **reference-** '../testsets/duc/task1_ref0_duc2003-500.txt'  **prediction-** '../testsets/duc/duc2003.10_300000-500.txt'

In [52]:
reference_doc = '../testsets/ro-en/newstest2016_ref_1000.en'
prediction_doc =  '../testsets/ro-en/newstest2016_output_1000.en'  

with open( reference_doc ,'r') as ref, open( prediction_doc ,'r') as pred:
    reference_en = ref.readlines()
    prediction_en = pred.readlines()

In [53]:
reference_en

['UN Chief Says There Is No Military Solution in Syria\n',
 'Secretary-General Ban Ki-moon says his response to Russia\'s stepped up military support for Syria is that "there is no military solution" to the nearly five-year conflict and more weapons will only worsen the violence and misery for millions of people .\n',
 'The U .N . chief again urged all parties, including the divided U .N . Security Council, to unite and support inclusive negotiations to find a political solution .\n',
 "Ban told a news conference Wednesday that he plans to meet with foreign ministers of the five permanent council nations - the U .S ., Russia, China, Britain and France - on the sidelines of the General Assembly's ministerial session later this month to discuss Syria .\n",
 'He expressed regret that divisions in the council and among the Syrian people and regional powers "made this situation unsolvable ."\n',
 'Ban urged the five permanent members to show the solidarity and unity they did in achieving an

In [40]:
prediction_en[:5]

['UN chief says no military solutions to Syria\n',
 "Secretary General Ban Ki moon says Russia's response to Russia's military support for Syria is that there is no military solution to the conflict lasting nearly five years and more weapons would only exacerbate violence and suffering millions of people .\n",
 'The UN chief urged all parties again , including the divided UN Security Council to unify and support negotiations in order to find a political solution .\n',
 "Ban said at a conference Wednesday that he planned to meet with foreign ministers from five permanent countries permanently present on the council this month - the US , Russia , China , England and France - on the edge of the General Assembly's ministerial session to discuss Syria .\n",
 'Ban voiced regret that divisions within the council and between Syrian people and regional powers have made this intractable situation .\n']

###  Optional preprocessing

In [21]:
def preprocessing(doc, stop_words_remove=False):
    remove_punctuation = []
    preprocessed_doc = []
    # keep only alphanumeric characters(remove punctuations)
    remove_punctuation = [re.sub(r"[^\w]", " ", sent).lower().strip() for sent in doc] 
    
    if stop_words_remove == True:
        # remove stop words requires lower cased tokens
        stop_words = set(stopwords.words("english"))
        for sent in remove_punctuation:
            filtered_sentence = [word for word in word_tokenize(sent.lower()) if not word in stop_words]
            preprocessed_doc.append(' '.join(filtered_sentence))
        return preprocessed_doc
    else:
        return remove_punctuation  

In [50]:
# use only if you want to preprocess the sentences

reference_en = preprocessing(reference_en, True) # True to remove stopwords, default only removes punctuation
prediction_en = preprocessing(prediction_en, True)

### Semantic similarity scores

In [8]:
# gpu mode : >> 1000 sentences/s
# cpu mode : ~100 sentences/s

In [50]:
len(reference_en)

999

In [54]:
ref_embedding = model.encode(reference_en, bsize=128, tokenize=False, verbose=True)
print('nb sentences encoded : {0}'.format(len(ref_embedding)))

Nb words kept : 15771/22274 (70.8%)


KeyError: '</p>'

In [70]:
pred_embedding = model.encode(prediction_en, bsize=128, tokenize=False, verbose=True)
print('nb sentences encoded : {0}'.format(len(pred_embedding)))

Nb words kept : 17025/22735 (74.9%)
Speed : 86.7 sentences/s (cpu mode, bsize=128)
nb sentences encoded : 1000


In [53]:
semantic_scores =[]
for i in range(len(ref_embedding)):
    semantic_scores.append(np.dot(ref_embedding[i],pred_embedding[i]) / (np.linalg.norm(ref_embedding[i])*(np.linalg.norm(pred_embedding[i]))))

### BLEU or ROUGE scores

Use BLEU scores for machine translation evaluation and ROUGE for text summarization evaluation.

In [16]:
# for machine translation evaluation
bleu_scores =[]
for i in range(len(reference_en)):
    bleu_scores.append(sentence_bleu(reference_en[i],prediction_en[i], smoothing_function=smoother.method4))

In [None]:
# for text summarization evaluation
rouge_scores = []
for i in range(len(reference_en)):
    *pr, f = rouge_n_sentence_level(prediction_en[i], reference_en[i], 2) # 2 for ROUGE-2. ROUGE-N, ROUGE-L and ROUGE-W scores can also be obtained.
    rouge_scores.append(f)

### Human annotation scores

Load the human annotation scores from the respective excel files as below,

- For **DE-EN** translation, '../human annotated/DE-EN.xlsx'


- For **RO-EN** translation, '../human annotated/RO-EN.xlsx'


- For **giga word** summarization(titles),'../human annotated/giga.xlsx'


- For **CNN-DM** summariation, '../human annotated/CNN_1000.xlsx'


- For **Duc 2003** summarization,  '../human annotated/duc2003.xlsx'


In [17]:
human_annotation = pd.read_excel('../human annotated/DE-EN.xlsx')

In [18]:
human_scores = human_annotation.iloc[:, 2].tolist()

### Pearson correlation coefficient

In [19]:
# correlation between human annotated scores and Bleu or ROUGE scores

#pearson correlation value, p-value
pearsonr(human_scores, bleu_scores) #bleu_scores or rouge_scores

(0.28439322985388266, 4.638694382037051e-20)

In [54]:
# correlation between human annotated scores and semantic similarity scores

pearsonr(human_scores, semantic_scores) # expected to be higher(more correlated) than with Bleu or ROUGE scores

(0.6053289356796738, 5.216603897862332e-101)