# InferSent

InferSent provides semantic representations for English sentences to build sentence embeddings. InferSent is trained on the Stanford Natural Language Inference (SNLI) dataset 
and generalizes well to various downstream NLP tasks. The source code can be found in [InferSent github repository](https://github.com/facebookresearch/InferSent)

In [1]:
%cd ../InferSent

C:\Users\d072726\Documents\Thesis\InferSent


In [120]:
# imports

from random import randint
import numpy as np
import torch
from models import InferSent
import numpy as np
import pandas as pd
from nltk.translate.bleu_score import sentence_bleu
from nltk.translate.bleu_score import SmoothingFunction
smoother = SmoothingFunction()
from rouge.rouge import rouge_n_sentence_level # pip install easy-rouge
from scipy.stats import pearsonr

In [121]:
# imports for preprocessing
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize 
import nltk
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\d072726\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\d072726\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

###  Pre-trained models

InferSent1 is pre-trained with Glove pre-trained word embeddings and InferSent2 is pre-trained with FastText pre-trained word embeddings. The pre-trained InferSent models can be downloaded directly [infersent1 here](https://dl.fbaipublicfiles.com/infersent/infersent1.pkl) and [infersent2 here](https://dl.fbaipublicfiles.com/infersent/infersent2.pkl) or can be downloaded using curl command as below.

In [None]:
!curl -Lo ../InferSent/encoder/infersent1.pkl https://dl.fbaipublicfiles.com/infersent/infersent1.pkl
!curl -Lo ../InferSent/encoder/infersent2.pkl https://dl.fbaipublicfiles.com/infersent/infersent2.pkl

In [197]:
# Load model

model_version = 2 # 1 for glove based model and 2 for fasttext based model
MODEL_PATH = "../InferSent/encoder/infersent%s.pkl" % model_version
params_model = {'bsize': 64, 'word_emb_dim': 300, 'enc_lstm_dim': 2048,
                'pool_type': 'max', 'dpout_model': 0.0, 'version': model_version}
model = InferSent(params_model)
model.load_state_dict(torch.load(MODEL_PATH))

<All keys matched successfully>

In [198]:
# Keep it on CPU or put it on GPU

use_cuda = False # True for GPU devices
model = model.cuda() if use_cuda else model

In [199]:
# If infersent1 -> use GloVe embeddings. If infersent2 -> use FastText embeddings

W2V_PATH = '../pretrained_embeddings/glove.840B.300d.txt' if model_version == 1 else '../pretrained_embeddings/wiki.en.vec'
model.set_w2v_path(W2V_PATH)

In [200]:
W2V_PATH

'../pretrained_embeddings/wiki.en.vec'

In [201]:
# Load embeddings of K most frequent words

model.build_vocab_k_words(K=1000000)

Vocab size : 1000000


### Load testsets for evaluation

The Automatically generated candidate texts (predictions) from machine translation or text summarization are evaluated against their reference texts. <br> Below are the testsets to be used for evaluation. 

- For **DE-EN** translation, <br> **Candidate-**   '../Testsets/DE-EN/multi30k.test.pred.en.atok'  **Reference-**      '../Testsets/DE-EN/test2016.en.atok'    <br>


- For **RO-EN** translation, <br> **Candidate-**-   '../Testsets/RO-EN/newstest2016_output_1000.en'  **Reference-**    '../Testsets/RO-EN/newstest2016_ref_1000.en'  <br>


- For **CNN-DM** summariation, <br> **Candidate-**   '../Testsets/CNN-DM/preprocessed_1000.pred'  **Reference-** '../Testsets/CNN-DM/preprocessed_1000.ref'  


- For **DUC2003** summarization, <br> **Candidate-**  '../Testsets/DUC2003/duc2003.10_300000-500.txt'  **Reference-** '../Testsets/DUC2003/task1_ref0_duc2003-500.txt'  


- For **Gigaword** summarization (titles), <br>  **Candidate-**  '../Testsets/Gigaword/giga.10_300000_500.txt'  **Reference-** '../Testsets/Gigaword/task1_ref0_giga_500.txt' 

In [206]:
reference_doc = '../testsets/ro-en/newstest2016_ref_1000.en'
prediction_doc =  '../testsets/ro-en/newstest2016_output_1000.en'  

with open( reference_doc ,'r') as ref, open( prediction_doc ,'r') as pred:
    reference_en = ref.readlines()
    prediction_en = pred.readlines()

In [67]:
reference_en[:5]

['Magnetic treatment may ease or lessen occurrence of schizophrenic voices.\n',
 'Evidence shows schizophrenia caused by gene abnormalities of Chromosome 1.\n',
 'Researchers examining evidence of link between schizophrenia and nicotine addiction.\n',
 'Scientists focusing on chemical environment of brain to understand schizophrenia.\n',
 "Schizophrenia study shows disparity between what's known and what's provided to patients.\n"]

In [68]:
prediction_en[:5]

['schizophrenia patients in the brains of their own area of their brains\n',
 'scientists say scientists say scientists say they are linked to <UNK> schizophrenia\n',
 'yale studies link between schizophrenia and nicotine addiction in study of schizophrenia\n',
 'study of <UNK> mental disease in new study of <UNK> mental disease\n',
 'schizophrenia may be most of mental diseases study finds study finds study\n']

###  Optional preprocessing

In [186]:
def preprocessing(doc, stop_words_remove=False):
    remove_punctuation = []
    preprocessed_doc = []
    # keep only alphanumeric characters(remove punctuations)
    remove_punctuation = [re.sub(r"[^\w]", " ", sent).lower().strip() for sent in doc] 
    
    if stop_words_remove == True:
        # remove stop words requires lower cased tokens
        stop_words = set(stopwords.words("english"))
        for sent in doc:
            filtered_sentence = [word for word in word_tokenize(sent.lower()) if not word in stop_words]
            preprocessed_doc.append(' '.join(filtered_sentence))
        return preprocessed_doc
    else:
        return remove_punctuation  

In [192]:
# use only if you want to preprocess the sentences

reference_en = preprocessing(reference_en, True) # True to remove stopwords, default only removes punctuation
prediction_en = preprocessing(prediction_en, True)

### Semantic similarity scores

In [8]:
# gpu mode : >> 1000 sentences/s
# cpu mode : ~100 sentences/s

In [207]:
ref_embedding = model.encode(reference_en, bsize=128, tokenize=False, verbose=True)
print('nb sentences encoded : {0}'.format(len(ref_embedding)))

Nb words kept : 16145/22279 (72.5%)


KeyError: '</p>'

In [None]:
pred_embedding = model.encode(prediction_en, bsize=128, tokenize=False, verbose=True)
print('nb sentences encoded : {0}'.format(len(pred_embedding)))

In [None]:
semantic_scores =[]
for i in range(len(ref_embedding)):
    semantic_scores.append(np.dot(ref_embedding[i],pred_embedding[i]) / (np.linalg.norm(ref_embedding[i])*(np.linalg.norm(pred_embedding[i]))))

### BLEU or ROUGE scores

Use BLEU scores for machine translation evaluation and ROUGE for text summarization evaluation.

In [16]:
# for machine translation evaluation
bleu_scores =[]
for i in range(len(reference_en)):
    bleu_scores.append(sentence_bleu(reference_en[i],prediction_en[i], smoothing_function=smoother.method4))

In [None]:
# for text summarization evaluation
rouge_scores = []
for i in range(len(reference_en)):
    *pr, f = rouge_n_sentence_level(prediction_en[i], reference_en[i], 2) # 2 for ROUGE-2. ROUGE-N, ROUGE-L and ROUGE-W scores can also be obtained.
    rouge_scores.append(f)

### Human annotation scores

Load the human annotation scores from the respective excel files as below,

- For **DE-EN** translation, '../Human annotations/DE-EN.xlsx'


- For **RO-EN** translation, '../Human annotations/RO-EN.xlsx'


- For **CNN-DM** summariation, '../Human annotations/CNN_1000.xlsx'


- For **DUC2003** summarization,  '../Human annotations/DUC2003.xlsx'


- For **Gigaword** summarization (titles),  '../Human annotations/Gigaword.xlsx'


In [183]:
human_annotation = pd.read_excel('../human annotated/duc2003.xlsx')

In [184]:
human_scores = human_annotation.iloc[:, 2].tolist()

### Pearson correlation coefficient

In [19]:
# correlation between human annotated scores and Bleu or ROUGE scores

#pearson correlation value, p-value
pearsonr(human_scores, bleu_scores) #bleu_scores or rouge_scores

(0.28439322985388266, 4.638694382037051e-20)

In [205]:
# correlation between human annotated scores and semantic similarity scores

pearsonr(human_scores, semantic_scores) # expected to be higher(more correlated) than with Bleu or ROUGE scores

(0.3039586616678434, 3.435720500550048e-12)