# Word Mover's Distance (WMD)

Word Mover's Distance  measures the distance between two documents or their word embeddings in a meaningful way, even if they have no words in common


Usually, one measures the distance between two word or sentence vectors using the cosine distance , which measures the angle between vectors. 
WMD, on the other hand, uses the Euclidean distance.  The Euclidean distance between two vectors might be large because their lengths differ, but the cosine distance is small because the angle between them is small, we can mitigate some of this by normalizing the vectors.

In [1]:
#Imports
import warnings
warnings.filterwarnings('ignore')

from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec
from gensim.similarities import WmdSimilarity
from pyemd import emd
import pandas as pd
from nltk.translate.bleu_score import sentence_bleu
from nltk.translate.bleu_score import SmoothingFunction
smoother = SmoothingFunction()
from rouge.rouge import rouge_n_sentence_level # pip install easy-rouge
from scipy.stats import pearsonr

In [2]:
# imports for preprocessing
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize 
import nltk
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\d072726\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\d072726\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### Pretrained word embeddings

- Download fasttext pretrained word embeddings [here](https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.en.vec)
- Download glove pretrained word embeddings [here](http://nlp.stanford.edu/data/glove.840B.300d.zip)

Unzip the glove embeddings and save the embeddings in a folder pretrained_embeddings.

In [3]:
#For convience convert the fasttext and glove embeddings to word2vec format

#for using fasttext embeddings
fasttext_model = KeyedVectors.load_word2vec_format('../pretrained_embeddings/wiki.en.vec')

INFO:gensim.models.utils_any2vec:loading projection weights from ../pretrained_embeddings/wiki.en.vec
INFO:gensim.models.utils_any2vec:loaded (2519370, 300) matrix from ../pretrained_embeddings/wiki.en.vec


In [30]:
#for using glove embeddings

glove_file = '../pretrained_embeddings/glove.840B.300d.txt'
tmp_file = '../pretrained_embeddings/glove_word2vec.txt'
#_ = glove2word2vec(glove_file, tmp_file)

glove_model = KeyedVectors.load_word2vec_format(tmp_file)

INFO:gensim.models.utils_any2vec:loading projection weights from ../pretrained_embeddings/glove_word2vec.txt
INFO:gensim.models.utils_any2vec:duplicate words detected, shrinking matrix size from 2196017 to 2196016
INFO:gensim.models.utils_any2vec:loaded (2196016, 300) matrix from ../pretrained_embeddings/glove_word2vec.txt


In [31]:
#normalizing the vectors
glove_model.init_sims(replace=True) #glove_model for using glove embeddings

INFO:gensim.models.keyedvectors:precomputing L2-norms of word weight vectors


### Load testsets for evaluation

The Automatically generated candidate texts (predictions) from machine translation or text summarization are evaluated against their reference texts. <br> Below are the testsets to be used for evaluation. 

- For **DE-EN** translation, <br> **Candidate-**   '../Testsets/DE-EN/multi30k.test.pred.en.atok'  **Reference-**      '../Testsets/DE-EN/test2016.en.atok'    <br>


- For **RO-EN** translation, <br> **Candidate-**-   '../Testsets/RO-EN/newstest2016_output_1000.en'  **Reference-**    '../Testsets/RO-EN/newstest2016_ref_1000.en'  <br>


- For **CNN-DM** summariation, <br> **Candidate-**   '../Testsets/CNN-DM/preprocessed_1000.pred'  **Reference-** '../Testsets/CNN-DM/preprocessed_1000.ref'  


- For **DUC2003** summarization, <br> **Candidate-**  '../Testsets/DUC2003/duc2003.10_300000-500.txt'  **Reference-** '../Testsets/DUC2003/task1_ref0_duc2003-500.txt'  


- For **Gigaword** summarization (titles), <br>  **Candidate-**  '../Testsets/Gigaword/giga.10_300000_500.txt'  **Reference-** '../Testsets/Gigaword/task1_ref0_giga_500.txt' 

In [66]:
reference_doc = '../testsets/giga/task1_ref0_giga_500.txt'
prediction_doc =  '../testsets/giga/giga.10_300000_500.txt'  

with open( reference_doc ,'r') as ref, open( prediction_doc ,'r') as pred:
    reference_en = ref.readlines()
    prediction_en = pred.readlines()

###  Optional preprocessing

In [72]:
def preprocessing(doc, stop_words_remove=False):
    remove_punctuation = []
    preprocessed_doc = []
    # keep only alphanumeric characters(remove punctuations)
    remove_punctuation = [re.sub(r"[^\w]", " ", sent).lower().strip() for sent in doc] 
    
    if stop_words_remove == True:
        # remove stop words requires lower cased tokens
        stop_words = set(stopwords.words("english"))
        for sent in remove_punctuation:
            filtered_sentence = [word for word in word_tokenize(sent) if not word in stop_words]
            preprocessed_doc.append(' '.join(filtered_sentence))
        return preprocessed_doc
    else:
        return remove_punctuation  

In [79]:
# use only if you want to preprocess the sentences

reference_en = preprocessing(reference_en, True) # True to remove stopwords, default only removes punctuation
prediction_en = preprocessing(prediction_en, True)

### Semantic similarity scores

In [80]:
distance = []
for i in range(len(reference_en)):
    distance.append(fasttext_model.wmdistance(reference_en[i],prediction_en[i]))

INFO:gensim.models.keyedvectors:Removed 4 and 3 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(15 unique tokens: ['a', 'c', 'e', 'i', 'k']...) from 2 documents (total 45 corpus positions)
INFO:gensim.models.keyedvectors:Removed 5 and 3 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(16 unique tokens: ['a', 'c', 'e', 'h', 'i']...) from 2 documents (total 58 corpus positions)
INFO:gensim.models.keyedvectors:Removed 4 and 2 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(14 unique tokens: ['a', 'c', 'e', 'f', 'g']...) from 2 documents (total 56 corpus positions)
INFO:gensim.models.keyedvectors:Re

INFO:gensim.corpora.dictionary:built Dictionary(17 unique tokens: ['a', 'd', 'e', 'f', 'h']...) from 2 documents (total 70 corpus positions)
INFO:gensim.models.keyedvectors:Removed 7 and 4 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(16 unique tokens: ['a', 'c', 'd', 'e', 'h']...) from 2 documents (total 82 corpus positions)
INFO:gensim.models.keyedvectors:Removed 7 and 3 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(19 unique tokens: ['a', 'c', 'd', 'e', 'g']...) from 2 documents (total 67 corpus positions)
INFO:gensim.models.keyedvectors:Removed 7 and 3 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:bui

INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(19 unique tokens: ['a', 'c', 'd', 'e', 'f']...) from 2 documents (total 70 corpus positions)
INFO:gensim.models.keyedvectors:Removed 6 and 6 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(18 unique tokens: ['a', 'b', 'c', 'e', 'f']...) from 2 documents (total 68 corpus positions)
INFO:gensim.models.keyedvectors:Removed 6 and 3 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(16 unique tokens: ['a', 'c', 'd', 'e', 'f']...) from 2 documents (total 74 corpus positions)
INFO:gensim.models.keyedvectors:Removed 6 and 2 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:add

INFO:gensim.models.keyedvectors:Removed 8 and 1 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(18 unique tokens: ['a', 'c', 'd', 'e', 'f']...) from 2 documents (total 63 corpus positions)
INFO:gensim.models.keyedvectors:Removed 11 and 2 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(20 unique tokens: ['a', 'b', 'c', 'd', 'e']...) from 2 documents (total 92 corpus positions)
INFO:gensim.models.keyedvectors:Removed 1 and 4 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(16 unique tokens: ['a', 'c', 'd', 'e', 'f']...) from 2 documents (total 44 corpus positions)
INFO:gensim.models.keyedvectors:R

INFO:gensim.corpora.dictionary:built Dictionary(23 unique tokens: ['a', 'b', 'c', 'd', 'e']...) from 2 documents (total 79 corpus positions)
INFO:gensim.models.keyedvectors:Removed 13 and 5 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(20 unique tokens: ['a', 'c', 'e', 'g', 'h']...) from 2 documents (total 98 corpus positions)
INFO:gensim.models.keyedvectors:Removed 4 and 3 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(18 unique tokens: ['a', 'c', 'd', 'e', 'f']...) from 2 documents (total 83 corpus positions)
INFO:gensim.models.keyedvectors:Removed 4 and 3 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:bu

INFO:gensim.models.keyedvectors:Removed 5 and 5 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(20 unique tokens: ['a', 'd', 'e', 'g', 'h']...) from 2 documents (total 56 corpus positions)
INFO:gensim.models.keyedvectors:Removed 6 and 2 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(17 unique tokens: ['a', 'c', 'd', 'e', 'f']...) from 2 documents (total 60 corpus positions)
INFO:gensim.models.keyedvectors:Removed 4 and 2 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(17 unique tokens: ['a', 'b', 'c', 'd', 'e']...) from 2 documents (total 53 corpus positions)
INFO:gensim.models.keyedvectors:Re

INFO:gensim.corpora.dictionary:built Dictionary(22 unique tokens: ['a', 'b', 'c', 'd', 'e']...) from 2 documents (total 73 corpus positions)
INFO:gensim.models.keyedvectors:Removed 5 and 2 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(15 unique tokens: ['a', 'd', 'e', 'g', 'h']...) from 2 documents (total 50 corpus positions)
INFO:gensim.models.keyedvectors:Removed 4 and 4 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(17 unique tokens: ['a', 'c', 'e', 'g', 'i']...) from 2 documents (total 56 corpus positions)
INFO:gensim.models.keyedvectors:Removed 5 and 4 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:bui

INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(15 unique tokens: ['a', 'd', 'e', 'g', 'h']...) from 2 documents (total 74 corpus positions)
INFO:gensim.models.keyedvectors:Removed 6 and 1 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(14 unique tokens: ['a', 'b', 'c', 'e', 'h']...) from 2 documents (total 47 corpus positions)
INFO:gensim.models.keyedvectors:Removed 8 and 4 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(20 unique tokens: ['a', 'b', 'd', 'e', 'i']...) from 2 documents (total 82 corpus positions)
INFO:gensim.models.keyedvectors:Removed 5 and 6 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:add

INFO:gensim.models.keyedvectors:Removed 6 and 2 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(17 unique tokens: ['a', 'e', 'f', 'g', 'i']...) from 2 documents (total 72 corpus positions)
INFO:gensim.models.keyedvectors:Removed 4 and 2 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(19 unique tokens: ['a', 'b', 'd', 'e', 'f']...) from 2 documents (total 44 corpus positions)
INFO:gensim.models.keyedvectors:Removed 8 and 1 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(20 unique tokens: ['a', 'c', 'd', 'e', 'f']...) from 2 documents (total 68 corpus positions)
INFO:gensim.models.keyedvectors:Re

INFO:gensim.corpora.dictionary:built Dictionary(18 unique tokens: ['a', 'b', 'e', 'f', 'g']...) from 2 documents (total 72 corpus positions)
INFO:gensim.models.keyedvectors:Removed 5 and 2 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(18 unique tokens: ['a', 'c', 'd', 'e', 'i']...) from 2 documents (total 65 corpus positions)
INFO:gensim.models.keyedvectors:Removed 7 and 3 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(19 unique tokens: ['a', 'b', 'd', 'e', 'g']...) from 2 documents (total 54 corpus positions)
INFO:gensim.models.keyedvectors:Removed 6 and 1 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:bui

INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(17 unique tokens: ['a', 'c', 'e', 'f', 'h']...) from 2 documents (total 65 corpus positions)
INFO:gensim.models.keyedvectors:Removed 5 and 3 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(16 unique tokens: ['a', 'b', 'c', 'd', 'e']...) from 2 documents (total 55 corpus positions)
INFO:gensim.models.keyedvectors:Removed 4 and 5 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(18 unique tokens: ['a', 'c', 'e', 'h', 'i']...) from 2 documents (total 66 corpus positions)
INFO:gensim.models.keyedvectors:Removed 7 and 4 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:add

INFO:gensim.models.keyedvectors:Removed 6 and 3 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(17 unique tokens: ['a', 'c', 'e', 'f', 'g']...) from 2 documents (total 53 corpus positions)
INFO:gensim.models.keyedvectors:Removed 6 and 2 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(19 unique tokens: ['a', 'b', 'c', 'e', 'g']...) from 2 documents (total 59 corpus positions)
INFO:gensim.models.keyedvectors:Removed 5 and 4 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(18 unique tokens: ['a', 'b', 'c', 'd', 'e']...) from 2 documents (total 81 corpus positions)
INFO:gensim.models.keyedvectors:Re

INFO:gensim.corpora.dictionary:built Dictionary(14 unique tokens: ['a', 'c', 'd', 'e', 'i']...) from 2 documents (total 55 corpus positions)
INFO:gensim.models.keyedvectors:Removed 7 and 5 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(15 unique tokens: ['a', 'b', 'c', 'e', 'i']...) from 2 documents (total 77 corpus positions)
INFO:gensim.models.keyedvectors:Removed 6 and 4 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(16 unique tokens: ['a', 'b', 'c', 'd', 'e']...) from 2 documents (total 82 corpus positions)
INFO:gensim.models.keyedvectors:Removed 5 and 1 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:bui

INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(12 unique tokens: ['a', 'd', 'e', 'g', 'i']...) from 2 documents (total 52 corpus positions)
INFO:gensim.models.keyedvectors:Removed 6 and 5 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(15 unique tokens: ['a', 'c', 'd', 'e', 'f']...) from 2 documents (total 78 corpus positions)
INFO:gensim.models.keyedvectors:Removed 8 and 3 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(19 unique tokens: ['a', 'b', 'd', 'e', 'f']...) from 2 documents (total 70 corpus positions)
INFO:gensim.models.keyedvectors:Removed 6 and 2 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:add

INFO:gensim.models.keyedvectors:Removed 3 and 3 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(17 unique tokens: ['a', 'b', 'c', 'd', 'e']...) from 2 documents (total 51 corpus positions)
INFO:gensim.models.keyedvectors:Removed 8 and 3 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(20 unique tokens: ['a', 'b', 'c', 'd', 'e']...) from 2 documents (total 90 corpus positions)
INFO:gensim.models.keyedvectors:Removed 4 and 3 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(16 unique tokens: ['a', 'c', 'e', 'g', 'h']...) from 2 documents (total 54 corpus positions)
INFO:gensim.models.keyedvectors:Re

INFO:gensim.corpora.dictionary:built Dictionary(18 unique tokens: ['a', 'b', 'c', 'd', 'e']...) from 2 documents (total 51 corpus positions)
INFO:gensim.models.keyedvectors:Removed 5 and 1 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(15 unique tokens: ['a', 'd', 'e', 'f', 'g']...) from 2 documents (total 48 corpus positions)
INFO:gensim.models.keyedvectors:Removed 5 and 4 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(15 unique tokens: ['a', 'c', 'd', 'e', 'g']...) from 2 documents (total 55 corpus positions)
INFO:gensim.models.keyedvectors:Removed 5 and 3 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:bui

INFO:gensim.models.keyedvectors:Removed 7 and 1 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(18 unique tokens: ['a', 'b', 'c', 'e', 'f']...) from 2 documents (total 60 corpus positions)
INFO:gensim.models.keyedvectors:Removed 6 and 3 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(16 unique tokens: ['a', 'b', 'c', 'd', 'e']...) from 2 documents (total 73 corpus positions)
INFO:gensim.models.keyedvectors:Removed 8 and 1 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(19 unique tokens: ['a', 'b', 'c', 'd', 'e']...) from 2 documents (total 73 corpus positions)
INFO:gensim.models.keyedvectors:Re

INFO:gensim.corpora.dictionary:built Dictionary(19 unique tokens: ['a', 'c', 'd', 'e', 'f']...) from 2 documents (total 65 corpus positions)
INFO:gensim.models.keyedvectors:Removed 2 and 1 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(15 unique tokens: ['a', 'c', 'd', 'e', 'g']...) from 2 documents (total 48 corpus positions)
INFO:gensim.models.keyedvectors:Removed 8 and 2 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(20 unique tokens: ['a', 'c', 'd', 'e', 'f']...) from 2 documents (total 73 corpus positions)
INFO:gensim.models.keyedvectors:Removed 5 and 4 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:bui

INFO:gensim.models.keyedvectors:Removed 4 and 3 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(14 unique tokens: ['a', 'c', 'd', 'e', 'g']...) from 2 documents (total 64 corpus positions)
INFO:gensim.models.keyedvectors:Removed 7 and 5 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(17 unique tokens: ['a', 'b', 'c', 'd', 'e']...) from 2 documents (total 68 corpus positions)
INFO:gensim.models.keyedvectors:Removed 5 and 3 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(15 unique tokens: ['a', 'b', 'c', 'e', 'f']...) from 2 documents (total 55 corpus positions)
INFO:gensim.models.keyedvectors:Re

INFO:gensim.corpora.dictionary:built Dictionary(17 unique tokens: ['a', 'c', 'e', 'g', 'i']...) from 2 documents (total 48 corpus positions)
INFO:gensim.models.keyedvectors:Removed 14 and 2 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(22 unique tokens: ['a', 'b', 'c', 'e', 'f']...) from 2 documents (total 99 corpus positions)
INFO:gensim.models.keyedvectors:Removed 7 and 3 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(19 unique tokens: ['a', 'c', 'e', 'h', 'k']...) from 2 documents (total 67 corpus positions)
INFO:gensim.models.keyedvectors:Removed 6 and 2 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:bu

In [81]:
semantic_scores = [1-score for score in distance]

### BLEU or ROUGE scores

Use BLEU scores for machine translation evaluation and ROUGE for text summarization evaluation.

In [134]:
# for machine translation evaluation
bleu_scores =[]
for i in range(len(reference_en)):
    bleu_scores.append(sentence_bleu(reference_en[i],prediction_en[i], smoothing_function=smoother.method4))

In [76]:
# for text summarization evaluation
rouge_scores = []
for i in range(len(reference_en)):
    *pr, f = rouge_n_sentence_level(prediction_en[i], reference_en[i], 2) # 2 for ROUGE-2. ROUGE-N, ROUGE-L and ROUGE-W scores can also be obtained.
    rouge_scores.append(f)

### Human annotation scores

Load the human annotation scores from the respective excel files as below,

- For **DE-EN** translation, '../Human annotations/DE-EN.xlsx'


- For **RO-EN** translation, '../Human annotations/RO-EN.xlsx'


- For **CNN-DM** summariation, '../Human annotations/CNN_1000.xlsx'


- For **DUC2003** summarization,  '../Human annotations/DUC2003.xlsx'


- For **Gigaword** summarization (titles),  '../Human annotations/Gigaword.xlsx'


In [49]:
human_annotation = pd.read_excel('../human annotated/giga.xlsx')

In [50]:
human_scores = human_annotation.iloc[:, 2].tolist()

### Pearson correlation coefficient

In [77]:
# correlation between human annotated scores and Bleu or ROUGE scores

#pearson correlation value, p-value
pearsonr(human_scores, rouge_scores) #bleu_scores or rouge_scores

(0.2593791685234605, 3.94658023133077e-09)

In [82]:
# correlation between human annotated scores and semantic similarity scores

pearsonr(human_scores, semantic_scores) # expected to be higher(more correlated) than with Bleu or ROUGE scores

(0.29766535801691774, 1.0910763315667692e-11)