# Smooth Inverse Frequency (SIF)

SIF shows a simple unsupervised method for sentence embedding can get results better than sophisticated supervised models like RNN's and LSTM's with a modification of weights for supervised models. This weighting improves performance by about 10% to 30% in textual similarity tasks.  SIF weighting scheme is shown below.

It has new "smoothing" terms that allow for words occurring out of context, as well as high probabilities for words like and, not in all contexts. 

In [None]:
!git clone https://github.com/PrincetonML/SIF.git

In [36]:
%cd ../SIF/src

[WinError 3] The system cannot find the path specified: '../SIF/src'
C:\Users\d072726\Documents\Thesis\SIF\src


In [112]:
# imports
import data_io, params, SIF_embedding
import numpy as np
from gensim.models import KeyedVectors
import pandas as pd
from nltk.translate.bleu_score import sentence_bleu
from nltk.translate.bleu_score import SmoothingFunction
smoother = SmoothingFunction()
from rouge.rouge import rouge_n_sentence_level # pip install easy-rouge
from scipy.stats import pearsonr

In [4]:
# imports for preprocessing
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize 
import nltk
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Error loading stopwords: <urlopen error [Errno 11001]
[nltk_data]     getaddrinfo failed>
[nltk_data] Error loading punkt: <urlopen error [Errno 11001]
[nltk_data]     getaddrinfo failed>


False

### Pretrained word embeddings

- Download fasttext pretrained word embeddings [here](https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.en.vec)
- Download glove pretrained word embeddings [here](http://nlp.stanford.edu/data/glove.840B.300d.zip)

Unzip the glove embeddings and save the embeddings in a folder pretrained_embeddings.

In [5]:
#For convience convert the fasttext and glove embeddings to word2vec format

#for using fasttext embeddings
fasttext_model = KeyedVectors.load_word2vec_format('../../pretrained_embeddings/wiki.en.vec')

INFO:gensim.models.utils_any2vec:loading projection weights from ../../pretrained_embeddings/wiki.en.vec
INFO:gensim.models.utils_any2vec:loaded (2519370, 300) matrix from ../../pretrained_embeddings/wiki.en.vec


In [35]:
#for using glove embeddings

glove_file = '../../pretrained_embeddings/glove.840B.300d.txt'
tmp_file = '../../pretrained_embeddings/glove_word2vec.txt'
#_ = glove2word2vec(glove_file, tmp_file)

glove_model = KeyedVectors.load_word2vec_format(tmp_file)

INFO:gensim.models.utils_any2vec:loading projection weights from ../../pretrained_embeddings/glove_word2vec.txt
INFO:gensim.models.utils_any2vec:duplicate words detected, shrinking matrix size from 2196017 to 2196016
INFO:gensim.models.utils_any2vec:loaded (2196016, 300) matrix from ../../pretrained_embeddings/glove_word2vec.txt


In [86]:
weightfile = '../auxiliary_data/enwiki_vocab_min200.txt' # each line is a word and its frequency
weightpara = 1e-3 # the parameter in the SIF weighting scheme, usually in the range [3e-5, 3e-3]
rmpc = 1 # number of principal components to remove in SIF weighting scheme

In [87]:
def get_embedding_matrix(model):
    Vocab = dict()
    embedding = []
    for i, word in enumerate(model.vocab):
        embedding.append(model[word])       
        Vocab[word] = i
    embedding_matrix = np.array(embedding)
    return embedding_matrix,Vocab

In [113]:
# Separate the embeddings and words 
embedding_matrix, vocab = get_embedding_matrix(glove_model) #glove_model for using glove embeddings

In [114]:
embedding_matrix.shape

(2196016, 300)

### Load testsets for evaluation

The Automatically generated candidate texts (predictions) from machine translation or text summarization are evaluated against their reference texts. <br> Below are the testsets to be used for evaluation. 

- For **DE-EN** translation, <br> **Candidate-**   '../Testsets/DE-EN/multi30k.test.pred.en.atok'  **Reference-**      '../Testsets/DE-EN/test2016.en.atok'    <br>


- For **RO-EN** translation, <br> **Candidate-**-   '../Testsets/RO-EN/newstest2016_output_1000.en'  **Reference-**    '../Testsets/RO-EN/newstest2016_ref_1000.en'  <br>


- For **CNN-DM** summariation, <br> **Candidate-**   '../Testsets/CNN-DM/preprocessed_1000.pred'  **Reference-** '../Testsets/CNN-DM/preprocessed_1000.ref'  


- For **DUC2003** summarization, <br> **Candidate-**  '../Testsets/DUC2003/duc2003.10_300000-500.txt'  **Reference-** '../Testsets/DUC2003/task1_ref0_duc2003-500.txt'  


- For **Gigaword** summarization (titles), <br>  **Candidate-**  '../Testsets/Gigaword/giga.10_300000_500.txt'  **Reference-** '../Testsets/Gigaword/task1_ref0_giga_500.txt' 

In [134]:
reference_doc = '../../testsets/duc/task1_ref0_duc2003-500.txt'
prediction_doc = '../../testsets/duc/duc2003.10_300000-500.txt' 

with open( reference_doc ,'r') as ref, open( prediction_doc ,'r') as pred:
    reference_en = ref.readlines()
    prediction_en = pred.readlines()

###  Optional preprocessing

In [123]:
def preprocessing(doc, stop_words_remove=False):
    remove_punctuation = []
    preprocessed_doc = []
    # keep only alphanumeric characters(remove punctuations)
    remove_punctuation = [re.sub(r"[^\w]", " ", sent).lower().strip() for sent in doc] 
    
    if stop_words_remove == True:
        # remove stop words requires lower cased tokens
        stop_words = set(stopwords.words("english"))
        for sent in doc:
            filtered_sentence = [word for word in word_tokenize(sent.lower()) if not word in stop_words]
            preprocessed_doc.append(' '.join(filtered_sentence))
        return preprocessed_doc
    else:
        return remove_punctuation  

In [147]:
# use only if you want to preprocess the sentences

reference_en = preprocessing(reference_en, True) # True to remove stopwords, default only removes punctuation
prediction_en = preprocessing(prediction_en, True)

### Semantic similarity scores

In [116]:
# load word weights
word2weight = data_io.getWordWeight(weightfile, weightpara) # word2weight['str'] is the weight for the word 'str'
weight4ind = data_io.getWeight(vocab, word2weight) # weight4ind[i] is the weight for the i-th word

In [117]:
def seq2weight(seq, mask, weight4ind):
    weight = np.zeros(seq.shape).astype('float32')
    for i in range(seq.shape[0]):
        for j in range(seq.shape[1]):
            if mask[i,j] > 0 and seq[i,j] >= 0:
                weight[i,j] = weight4ind[seq[i,j]]
    weight = np.asarray(weight, dtype='float32')
    return weight

In [118]:
# set parameters
params = params.params()
params.rmpc = rmpc

In [148]:
# load reference sentences
x, m = data_io.sentences2idx(reference_en, vocab) # x is the array of word indices, m is the binary mask indicating whether there is a word in that location
w = seq2weight(x, m, weight4ind) # get word weights

# get SIF embedding
embedding_ref = SIF_embedding.SIF_embedding(embedding_matrix, x, w, params) 

In [149]:
# load prediction sentences
x, m = data_io.sentences2idx(prediction_en, vocab) # x is the array of word indices, m is the binary mask indicating whether there is a word in that location
w = seq2weight(x, m, weight4ind) # get word weights

# get SIF embedding
embedding_pred = SIF_embedding.SIF_embedding(embedding_matrix, x, w, params) 

In [150]:
semantic_scores =[]
for i in range(len(embedding_ref)):
    semantic_scores.append(np.dot(embedding_ref[i],embedding_pred[i]) / (np.linalg.norm(embedding_ref[i])*(np.linalg.norm(embedding_pred[i]))))

### BLEU or ROUGE scores

Use BLEU scores for machine translation evaluation and ROUGE for text summarization evaluation.

In [22]:
# for machine translation evaluation
bleu_scores =[]
for i in range(len(reference_en)):
    bleu_scores.append(sentence_bleu(reference_en[i],prediction_en[i], smoothing_function=smoother.method4))

In [96]:
# for text summarization evaluation
rouge_scores = []
for i in range(len(reference_en)):
    *pr, f = rouge_n_sentence_level(prediction_en[i], reference_en[i], 2) # 2 for ROUGE-2. ROUGE-N, ROUGE-L and ROUGE-W scores can also be obtained.
    rouge_scores.append(f)

### Human annotation scores

Load the human annotation scores from the respective excel files as below,

- For **DE-EN** translation, '../Human annotations/DE-EN.xlsx'


- For **RO-EN** translation, '../Human annotations/RO-EN.xlsx'


- For **CNN-DM** summariation, '../Human annotations/CNN_1000.xlsx'


- For **DUC2003** summarization,  '../Human annotations/DUC2003.xlsx'


- For **Gigaword** summarization (titles),  '../Human annotations/Gigaword.xlsx'


In [139]:
human_annotation = pd.read_excel('../../human annotated/duc2003.xlsx')

In [140]:
human_scores = human_annotation.iloc[:, 2].tolist()

### Pearson correlation coefficient

In [99]:
# correlation between human annotated scores and Bleu or ROUGE scores

#pearson correlation value, p-value
pearsonr(human_scores, rouge_scores) #bleu_scores or rouge_scores

(0.2618920908180919, 2.7559701344461234e-09)

In [151]:
# correlation between human annotated scores and semantic similarity scores

pearsonr(human_scores, semantic_scores) # expected to be higher(more correlated) than with Bleu or ROUGE scores

(0.14251804601732798, 0.0013668344947902107)