# Smooth Inverse Frequency (SIF)

SIF shows a simple unsupervised method for sentence embedding can get results better than sophisticated supervised models like RNN's and LSTM's with a modification of weights for supervised models. This weighting improves performance by about 10% to 30% in textual similarity tasks.  SIF weighting scheme is shown below.

It has new "smoothing" terms that allow for words occurring out of context, as well as high probabilities for words like and, not in all contexts. 

In [1]:
!git clone https://github.com/PrincetonML/SIF.git

Cloning into 'SIF'...


In [1]:
%cd SIF/src

C:\Users\d072726\Downloads\Master-Thesis\Source code\SIF\src


In [2]:
# imports

import data_io, params, SIF_embedding
from gensim.models import KeyedVectors
import numpy as np
import pandas as pd
from nltk.translate.bleu_score import sentence_bleu
from nltk.translate.bleu_score import SmoothingFunction
smoother = SmoothingFunction()
from rouge.rouge import rouge_n_sentence_level # pip install easy-rouge
from scipy.stats import pearsonr

In [3]:
# imports for preprocessing

import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize 
import nltk
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\d072726\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\d072726\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### Pretrained word embeddings

- Download fasttext pretrained word embeddings [here](https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.en.vec)
- Download glove pretrained word embeddings [here](http://nlp.stanford.edu/data/glove.840B.300d.zip)

Unzip the glove embeddings and save the embeddings in a folder Pretrained_embeddings.

In [4]:
#For convience convert the fasttext and glove embeddings to word2vec format

#for using fasttext embeddings

fasttext_model = KeyedVectors.load_word2vec_format('../../../Pretrained_embeddings/wiki.en.vec')

INFO:gensim.models.utils_any2vec:loading projection weights from ../../../Pretrained_embeddings/wiki.en.vec
INFO:gensim.models.utils_any2vec:loaded (2519370, 300) matrix from ../../../Pretrained_embeddings/wiki.en.vec


In [35]:
#for using glove embeddings

glove_file = '../../../Pretrained_embeddings/glove.840B.300d.txt'
tmp_file = '../../../Pretrained_embeddings/glove_word2vec.txt'
#_ = glove2word2vec(glove_file, tmp_file)

glove_model = KeyedVectors.load_word2vec_format(tmp_file)

INFO:gensim.models.utils_any2vec:loading projection weights from ../../pretrained_embeddings/glove_word2vec.txt
INFO:gensim.models.utils_any2vec:duplicate words detected, shrinking matrix size from 2196017 to 2196016
INFO:gensim.models.utils_any2vec:loaded (2196016, 300) matrix from ../../pretrained_embeddings/glove_word2vec.txt


In [5]:
weightfile = '../auxiliary_data/enwiki_vocab_min200.txt' # each line is a word and its frequency
weightpara = 1e-3 # the parameter in the SIF weighting scheme, usually in the range [3e-5, 3e-3]
rmpc = 1 # number of principal components to remove in SIF weighting scheme

In [6]:
def get_embedding_matrix(model):
    Vocab = dict()
    embedding = []
    for i, word in enumerate(model.vocab):
        embedding.append(model[word])       
        Vocab[word] = i
    embedding_matrix = np.array(embedding)
    return embedding_matrix,Vocab

In [7]:
# Separate the embeddings and words 
embedding_matrix, vocab = get_embedding_matrix(fasttext_model) #glove_model for using glove embeddings

In [8]:
embedding_matrix.shape

(2519370, 300)

### Load testsets for evaluation

The Automatically generated candidate texts (predictions) from machine translation or text summarization are evaluated against their reference texts. <br> Below are the testsets to be used for evaluation. 

- For **DE-EN** translation, <br> **Candidate-**   '../Testsets/DE-EN/multi30k.test.pred.en.atok'  **Reference-**      '../Testsets/DE-EN/test2016.en.atok'    <br>


- For **RO-EN** translation, <br> **Candidate-**-   '../Testsets/RO-EN/newstest2016_output_1000.en'  **Reference-**    '../Testsets/RO-EN/newstest2016_ref_1000.en'  <br>


- For **CNN-DM** summariation, <br> **Candidate-**   '../Testsets/CNN-DM/preprocessed_1000.pred'  **Reference-** '../Testsets/CNN-DM/preprocessed_1000.ref'  


- For **DUC2003** summarization, <br> **Candidate-**  '../Testsets/DUC2003/duc2003.10_300000-500.txt'  **Reference-** '../Testsets/DUC2003/task1_ref0_duc2003-500.txt'  


- For **Gigaword** summarization (titles), <br>  **Candidate-**  '../Testsets/Gigaword/giga.10_300000_500.txt'  **Reference-** '../Testsets/Gigaword/task1_ref0_giga_500.txt' 

In [9]:
candidate_doc =  '../../../Testsets/DE-EN/multi30k.test.pred.en.atok'  
reference_doc = '../../../Testsets/DE-EN/test2016.en.atok' 

with  open( candidate_doc ,'r') as cand, open( reference_doc ,'r') as ref:
    candidate_en = cand.readlines()
    reference_en = ref.readlines()   

In [10]:
candidate_en[:5]

['A man in an orange hat presenting something .\n',
 'A Boston traveler runs across lush , green fence in front of a white fence .\n',
 'A girl in a karate uniform is blocking a board with a kick .\n',
 'Five people in winter jackets and helmets are standing in the snow with vials in the background .\n',
 'People moving off the roof of a house .\n']

In [11]:
reference_en[:5]

['A man in an orange hat starring at something .\n',
 'A Boston Terrier is running on lush green grass in front of a white fence .\n',
 'A girl in karate uniform breaking a stick with a front kick .\n',
 'Five people wearing winter jackets and helmets stand in the snow , with snowmobiles in the background .\n',
 'People are fixing the roof of a house .\n']

###  Optional preprocessing

In [12]:
def preprocessing(doc, stop_words_remove=False):
    remove_punctuation = []
    preprocessed_doc = []
    # keep only alphanumeric characters(remove punctuations)
    remove_punctuation = [re.sub(r"[^\w]", " ", sent).lower().strip() for sent in doc] 
    
    if stop_words_remove == True:
        # remove stop words requires lower cased tokens
        stop_words = set(stopwords.words("english"))
        for sent in doc:
            filtered_sentence = [word for word in word_tokenize(sent.lower()) if not word in stop_words]
            preprocessed_doc.append(' '.join(filtered_sentence))
        return preprocessed_doc
    else:
        return remove_punctuation  

In [13]:
# use only if you want to preprocess the sentences

candidate_en = preprocessing(candidate_en, False) # True to remove stopwords, default only removes punctuation
reference_en = preprocessing(reference_en, False) 

### Semantic similarity scores

In [12]:
# load word weights

word2weight = data_io.getWordWeight(weightfile, weightpara) # word2weight['str'] is the weight for the word 'str'
weight4ind = data_io.getWeight(vocab, word2weight) # weight4ind[i] is the weight for the i-th word

In [13]:
def seq2weight(seq, mask, weight4ind):
    weight = np.zeros(seq.shape).astype('float32')
    for i in range(seq.shape[0]):
        for j in range(seq.shape[1]):
            if mask[i,j] > 0 and seq[i,j] >= 0:
                weight[i,j] = weight4ind[seq[i,j]]
    weight = np.asarray(weight, dtype='float32')
    return weight

In [14]:
# set parameters
params = params.params()
params.rmpc = rmpc

In [15]:
# load candidate sentences
# change iteritems to items and xrange to range in the source code of SIF for python3

x, m = data_io.sentences2idx(candidate_en, vocab) # x is the array of word indices, m is the binary mask indicating whether there is a word in that location
w = seq2weight(x, m, weight4ind) # get word weights

# get SIF embedding

cand_embedding = SIF_embedding.SIF_embedding(embedding_matrix, x, w, params) 

In [16]:
# load reference sentences

x, m = data_io.sentences2idx(reference_en, vocab) # x is the array of word indices, m is the binary mask indicating whether there is a word in that location
w = seq2weight(x, m, weight4ind) # get word weights

# get SIF embedding

ref_embedding = SIF_embedding.SIF_embedding(embedding_matrix, x, w, params) 

In [19]:
# Cosine similarity function

semantic_scores =[]
for i in range(len(cand_embedding)):
    semantic_scores.append(np.dot(cand_embedding[i],ref_embedding[i]) / (np.linalg.norm(cand_embedding[i])*(np.linalg.norm(ref_embedding[i]))))

### BLEU or ROUGE scores

Use BLEU scores for machine translation evaluation and ROUGE for text summarization evaluation.

In [20]:
# for machine translation evaluation

bleu_scores =[]
for i in range(len(reference_en)):
    bleu_scores.append(sentence_bleu(candidate_en[i],reference_en[i], smoothing_function=smoother.method4))

In [96]:
# for text summarization evaluation

rouge_scores = []
for i in range(len(reference_en)):
    *pr, f = rouge_n_sentence_level(candidate_en[i], reference_en[i], 1) # 2 for ROUGE-2. ROUGE-N, ROUGE-L and ROUGE-W scores can also be obtained.
    rouge_scores.append(f)

### Human annotation scores

Load the human annotation scores from the respective excel files as below,

- For **DE-EN** translation, '../Human annotations/DE-EN.xlsx'


- For **RO-EN** translation, '../Human annotations/RO-EN.xlsx'


- For **CNN-DM** summariation, '../Human annotations/CNN_1000.xlsx'


- For **DUC2003** summarization,  '../Human annotations/DUC2003.xlsx'


- For **Gigaword** summarization (titles),  '../Human annotations/Gigaword.xlsx'


In [23]:
human_annotation = pd.read_excel('../../../Human annotations/DE-EN.xlsx')

In [24]:
human_scores = human_annotation.iloc[:, 2].tolist()

### Pearson correlation coefficient

In [25]:
# correlation between human annotated scores and Bleu or ROUGE scores

#pearson correlation value, p-value

pearsonr(human_scores, bleu_scores) # bleu_scores or rouge_scores

(0.3901802069640419, 1.0390848166845472e-37)

In [26]:
# correlation between human annotated scores and semantic similarity scores

pearsonr(human_scores, semantic_scores) # expected to be higher(more correlated) than with Bleu or ROUGE scores

(0.6511392184339738, 1.110425667360541e-121)