# Word Mover's Distance (WMD)

Word Mover's Distance  measures the distance between two documents or their word embeddings in a meaningful way, even if they have no words in common


Usually, one measures the distance between two word or sentence vectors using the cosine distance , which measures the angle between vectors. 
WMD, on the other hand, uses the Euclidean distance.  The Euclidean distance between two vectors might be large because their lengths differ, but the cosine distance is small because the angle between them is small, we can mitigate some of this by normalizing the vectors.

In [142]:
#Imports
import warnings
warnings.filterwarnings('ignore')

from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec
from gensim.similarities import WmdSimilarity
from pyemd import emd
import pandas as pd
from nltk.translate.bleu_score import sentence_bleu
from nltk.translate.bleu_score import SmoothingFunction
smoother = SmoothingFunction()
from rouge.rouge import rouge_n_sentence_level # pip install easy-rouge
from scipy.stats import pearsonr

In [6]:
# imports for preprocessing
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize 
import nltk
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\d072726\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\d072726\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### Pretrained word embeddings

- Download fasttext pretrained word embeddings [here](https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.en.vec)
- Download glove pretrained word embeddings [here](http://nlp.stanford.edu/data/glove.840B.300d.zip)

Unzip the glove embeddings and save the embeddings in a folder pretrained_embeddings.

In [19]:
#For convience convert the fasttext and glove embeddings to word2vec format

#for using fasttext embeddings
fasttext_model = KeyedVectors.load_word2vec_format('pretrained_embeddings/wiki.en.vec')

INFO:gensim.models.utils_any2vec:loading projection weights from pretrained_embeddings/wiki.en.vec
INFO:gensim.models.utils_any2vec:loaded (2519370, 300) matrix from pretrained_embeddings/wiki.en.vec


In [4]:
#for using glove embeddings

glove_file = 'pretrained_embeddings/glove.840B.300d.txt'
tmp_file = 'pretrained_embeddings/glove_word2vec.txt'
_ = glove2word2vec(glove_file, tmp_file)

glove_model = KeyedVectors.load_word2vec_format(tmp_file)

In [90]:
#normalizing the vectors
fasttext_model.init_sims(replace=True) #glove_model for using glove embeddings

INFO:gensim.models.keyedvectors:precomputing L2-norms of word weight vectors


### Load testsets for evaluation

The Automatically generated texts (predictions) from machine translation or text summarization are evaluated against their reference texts. <br> Below are the testsets to be used for evaluation. 

- For **DE-EN** translation, <br>  **reference-** 'testsets/de-en/test2016.en.atok'   **prediction-** 'testsets/de-en/multi30k.test.pred.en.atok' <br>


- For **RO-EN** translation, <br>  **reference-** 'testsets/ro-en/newstest2016_ref_1000.en'  **prediction-**- 'testsets/ro-en/newstest2016_output_1000.en'<br>


- For **giga word** summarization(titles), <br>  **reference-** 'testsets/giga/task1_ref0_giga_450.txt'  **prediction-**'testsets/giga/giga.10_300000_450.txt'


- For **CNN-DM** summariation, <br>  **reference-** 'testsets/cnn/preprocessed.ref'  **prediction-** 'testsets/cnn/preprocessed.pred'


- For **Duc 2003** summarization, <br>  **reference-** 'testsets/duc/task1_ref0_duc2003.txt'  **prediction-** 'testsets/duc/duc2003.10_300000.txt'

In [161]:
reference_doc = 'testsets/de-en/test2016.en.atok'
prediction_doc = 'testsets/de-en/multi30k.test.pred.en.atok'  

with open( reference_doc ,'r') as ref, open( prediction_doc ,'r') as pred:
    reference_en = ref.readlines()
    prediction_en = pred.readlines()

###  Optional preprocessing

In [147]:
def preprocessing(doc, stop_words_remove=False):
    remove_punctuation = []
    preprocessed_doc = []
    # keep only alphanumeric characters(remove punctuations)
    remove_punctuation = [re.sub(r"[^\w]", " ", sent).lower().strip() for sent in doc] 
    
    if stop_words_remove == True:
        # remove stop words requires lower cased tokens
        stop_words = set(stopwords.words("english"))
        for sent in remove_punctuation:
            filtered_sentence = [word for word in word_tokenize(sent) if not word in stop_words]
            preprocessed_doc.append(' '.join(filtered_sentence))
        return preprocessed_doc
    else:
        return remove_punctuation  

In [162]:
# use only if you want to preprocess the sentences

reference_en = preprocessing(reference_en, False) # True to remove stopwords, default only removes punctuation
prediction_en = preprocessing(prediction_en, False)

### Semantic similarity scores

In [163]:
distance = []
for i in range(len(reference_en)):
    distance.append(glove_model.wmdistance(reference_en[i],prediction_en[i]))

INFO:gensim.models.keyedvectors:Removed 8 and 7 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(12 unique tokens: ['a', 'e', 'g', 'h', 'i']...) from 2 documents (total 72 corpus positions)
INFO:gensim.models.keyedvectors:Removed 14 and 15 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(17 unique tokens: ['a', 'b', 'c', 'e', 'f']...) from 2 documents (total 118 corpus positions)
INFO:gensim.models.keyedvectors:Removed 11 and 12 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(19 unique tokens: ['a', 'b', 'c', 'e', 'f']...) from 2 documents (total 94 corpus positions)
INFO:gensim.models.keyedvecto

INFO:gensim.corpora.dictionary:built Dictionary(18 unique tokens: ['a', 'b', 'c', 'd', 'e']...) from 2 documents (total 115 corpus positions)
INFO:gensim.models.keyedvectors:Removed 5 and 6 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(15 unique tokens: ['a', 'b', 'c', 'e', 'f']...) from 2 documents (total 55 corpus positions)
INFO:gensim.models.keyedvectors:Removed 10 and 11 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(20 unique tokens: ['a', 'b', 'c', 'e', 'f']...) from 2 documents (total 90 corpus positions)
INFO:gensim.models.keyedvectors:Removed 11 and 8 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary

INFO:gensim.models.keyedvectors:Removed 10 and 13 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(20 unique tokens: ['a', 'b', 'd', 'e', 'f']...) from 2 documents (total 109 corpus positions)
INFO:gensim.models.keyedvectors:Removed 7 and 8 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(16 unique tokens: ['a', 'e', 'f', 'g', 'i']...) from 2 documents (total 64 corpus positions)
INFO:gensim.models.keyedvectors:Removed 9 and 8 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(16 unique tokens: ['a', 'e', 'g', 'h', 'i']...) from 2 documents (total 61 corpus positions)
INFO:gensim.models.keyedvectors

INFO:gensim.corpora.dictionary:built Dictionary(17 unique tokens: ['a', 'b', 'd', 'e', 'g']...) from 2 documents (total 76 corpus positions)
INFO:gensim.models.keyedvectors:Removed 11 and 10 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(16 unique tokens: ['a', 'b', 'd', 'e', 'g']...) from 2 documents (total 94 corpus positions)
INFO:gensim.models.keyedvectors:Removed 12 and 14 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(17 unique tokens: ['a', 'b', 'd', 'e', 'g']...) from 2 documents (total 106 corpus positions)
INFO:gensim.models.keyedvectors:Removed 18 and 20 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.diction

INFO:gensim.models.keyedvectors:Removed 19 and 13 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(20 unique tokens: ['a', 'b', 'c', 'e', 'f']...) from 2 documents (total 122 corpus positions)
INFO:gensim.models.keyedvectors:Removed 11 and 10 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(19 unique tokens: ['a', 'e', 'f', 'g', 'h']...) from 2 documents (total 87 corpus positions)
INFO:gensim.models.keyedvectors:Removed 8 and 9 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(18 unique tokens: ['a', 'b', 'c', 'd', 'e']...) from 2 documents (total 77 corpus positions)
INFO:gensim.models.keyedvecto

INFO:gensim.corpora.dictionary:built Dictionary(18 unique tokens: ['a', 'b', 'c', 'd', 'e']...) from 2 documents (total 118 corpus positions)
INFO:gensim.models.keyedvectors:Removed 8 and 8 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(17 unique tokens: ['a', 'c', 'd', 'g', 'h']...) from 2 documents (total 71 corpus positions)
INFO:gensim.models.keyedvectors:Removed 13 and 12 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(21 unique tokens: ['a', 'b', 'c', 'd', 'e']...) from 2 documents (total 111 corpus positions)
INFO:gensim.models.keyedvectors:Removed 12 and 12 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictiona

INFO:gensim.models.keyedvectors:Removed 10 and 9 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(20 unique tokens: ['a', 'c', 'd', 'e', 'g']...) from 2 documents (total 120 corpus positions)
INFO:gensim.models.keyedvectors:Removed 7 and 6 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(19 unique tokens: ['a', 'b', 'd', 'e', 'g']...) from 2 documents (total 81 corpus positions)
INFO:gensim.models.keyedvectors:Removed 10 and 9 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(18 unique tokens: ['a', 'd', 'e', 'g', 'i']...) from 2 documents (total 83 corpus positions)
INFO:gensim.models.keyedvectors

INFO:gensim.corpora.dictionary:built Dictionary(20 unique tokens: ['a', 'b', 'c', 'd', 'e']...) from 2 documents (total 96 corpus positions)
INFO:gensim.models.keyedvectors:Removed 10 and 10 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(19 unique tokens: ['a', 'b', 'c', 'd', 'e']...) from 2 documents (total 89 corpus positions)
INFO:gensim.models.keyedvectors:Removed 7 and 6 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(17 unique tokens: ['a', 'b', 'c', 'e', 'g']...) from 2 documents (total 54 corpus positions)
INFO:gensim.models.keyedvectors:Removed 17 and 14 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary

INFO:gensim.models.keyedvectors:Removed 9 and 11 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(16 unique tokens: ['a', 'b', 'd', 'e', 'h']...) from 2 documents (total 96 corpus positions)
INFO:gensim.models.keyedvectors:Removed 7 and 6 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(17 unique tokens: ['a', 'b', 'c', 'd', 'g']...) from 2 documents (total 50 corpus positions)
INFO:gensim.models.keyedvectors:Removed 11 and 8 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(20 unique tokens: ['a', 'b', 'c', 'd', 'e']...) from 2 documents (total 84 corpus positions)
INFO:gensim.models.keyedvectors:

INFO:gensim.corpora.dictionary:built Dictionary(24 unique tokens: ['0', '2', '7', 'a', 'b']...) from 2 documents (total 123 corpus positions)
INFO:gensim.models.keyedvectors:Removed 12 and 13 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(17 unique tokens: ['a', 'c', 'd', 'e', 'f']...) from 2 documents (total 104 corpus positions)
INFO:gensim.models.keyedvectors:Removed 18 and 12 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(18 unique tokens: ['a', 'b', 'c', 'd', 'e']...) from 2 documents (total 127 corpus positions)
INFO:gensim.models.keyedvectors:Removed 6 and 6 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.diction

INFO:gensim.models.keyedvectors:Removed 14 and 10 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(20 unique tokens: ['a', 'b', 'c', 'd', 'e']...) from 2 documents (total 96 corpus positions)
INFO:gensim.models.keyedvectors:Removed 9 and 8 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(17 unique tokens: ['a', 'c', 'd', 'e', 'f']...) from 2 documents (total 71 corpus positions)
INFO:gensim.models.keyedvectors:Removed 13 and 13 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(21 unique tokens: ['a', 'b', 'c', 'd', 'e']...) from 2 documents (total 111 corpus positions)
INFO:gensim.models.keyedvecto

INFO:gensim.corpora.dictionary:built Dictionary(16 unique tokens: ['a', 'b', 'c', 'd', 'e']...) from 2 documents (total 95 corpus positions)
INFO:gensim.models.keyedvectors:Removed 10 and 11 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(21 unique tokens: ['a', 'b', 'c', 'd', 'e']...) from 2 documents (total 86 corpus positions)
INFO:gensim.models.keyedvectors:Removed 6 and 7 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(12 unique tokens: ['a', 'b', 'c', 'd', 'g']...) from 2 documents (total 56 corpus positions)
INFO:gensim.models.keyedvectors:Removed 10 and 11 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary

INFO:gensim.models.keyedvectors:Removed 8 and 9 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(20 unique tokens: ['a', 'b', 'd', 'e', 'f']...) from 2 documents (total 93 corpus positions)
INFO:gensim.models.keyedvectors:Removed 11 and 11 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(20 unique tokens: ['a', 'b', 'd', 'e', 'f']...) from 2 documents (total 126 corpus positions)
INFO:gensim.models.keyedvectors:Removed 8 and 8 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(17 unique tokens: ['a', 'b', 'c', 'd', 'e']...) from 2 documents (total 92 corpus positions)
INFO:gensim.models.keyedvectors

INFO:gensim.corpora.dictionary:built Dictionary(19 unique tokens: ['a', 'c', 'd', 'e', 'g']...) from 2 documents (total 127 corpus positions)
INFO:gensim.models.keyedvectors:Removed 11 and 12 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(19 unique tokens: ['a', 'b', 'c', 'd', 'e']...) from 2 documents (total 99 corpus positions)
INFO:gensim.models.keyedvectors:Removed 15 and 15 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(18 unique tokens: ['a', 'c', 'd', 'e', 'f']...) from 2 documents (total 148 corpus positions)
INFO:gensim.models.keyedvectors:Removed 9 and 11 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.diction

INFO:gensim.models.keyedvectors:Removed 30 and 19 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(21 unique tokens: ['a', 'b', 'c', 'd', 'e']...) from 2 documents (total 187 corpus positions)
INFO:gensim.models.keyedvectors:Removed 9 and 8 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(17 unique tokens: ['a', 'b', 'd', 'g', 'h']...) from 2 documents (total 62 corpus positions)
INFO:gensim.models.keyedvectors:Removed 10 and 10 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(21 unique tokens: ['a', 'c', 'd', 'e', 'g']...) from 2 documents (total 94 corpus positions)
INFO:gensim.models.keyedvecto

INFO:gensim.corpora.dictionary:built Dictionary(23 unique tokens: ['3', '9', 'a', 'b', 'c']...) from 2 documents (total 73 corpus positions)
INFO:gensim.models.keyedvectors:Removed 13 and 10 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(20 unique tokens: ['a', 'b', 'd', 'e', 'g']...) from 2 documents (total 105 corpus positions)
INFO:gensim.models.keyedvectors:Removed 11 and 10 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(18 unique tokens: ['a', 'b', 'c', 'e', 'f']...) from 2 documents (total 86 corpus positions)
INFO:gensim.models.keyedvectors:Removed 9 and 10 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictiona

INFO:gensim.models.keyedvectors:Removed 9 and 9 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(19 unique tokens: ['a', 'e', 'f', 'g', 'h']...) from 2 documents (total 81 corpus positions)
INFO:gensim.models.keyedvectors:Removed 9 and 9 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(19 unique tokens: ['a', 'c', 'd', 'e', 'f']...) from 2 documents (total 100 corpus positions)
INFO:gensim.models.keyedvectors:Removed 10 and 10 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(19 unique tokens: ['a', 'b', 'c', 'd', 'e']...) from 2 documents (total 88 corpus positions)
INFO:gensim.models.keyedvectors

INFO:gensim.corpora.dictionary:built Dictionary(19 unique tokens: ['a', 'c', 'd', 'e', 'f']...) from 2 documents (total 107 corpus positions)
INFO:gensim.models.keyedvectors:Removed 11 and 14 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(22 unique tokens: ['a', 'c', 'd', 'e', 'f']...) from 2 documents (total 131 corpus positions)
INFO:gensim.models.keyedvectors:Removed 9 and 6 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(16 unique tokens: ['a', 'd', 'e', 'g', 'h']...) from 2 documents (total 64 corpus positions)
INFO:gensim.models.keyedvectors:Removed 7 and 12 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionar

INFO:gensim.models.keyedvectors:Removed 9 and 9 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(17 unique tokens: ['a', 'b', 'd', 'e', 'f']...) from 2 documents (total 74 corpus positions)
INFO:gensim.models.keyedvectors:Removed 8 and 9 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(16 unique tokens: ['a', 'e', 'f', 'g', 'l']...) from 2 documents (total 77 corpus positions)
INFO:gensim.models.keyedvectors:Removed 13 and 11 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(17 unique tokens: ['a', 'c', 'd', 'e', 'f']...) from 2 documents (total 110 corpus positions)
INFO:gensim.models.keyedvectors

INFO:gensim.corpora.dictionary:built Dictionary(17 unique tokens: ['a', 'b', 'd', 'e', 'g']...) from 2 documents (total 92 corpus positions)
INFO:gensim.models.keyedvectors:Removed 11 and 13 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(17 unique tokens: ['a', 'b', 'c', 'd', 'e']...) from 2 documents (total 99 corpus positions)
INFO:gensim.models.keyedvectors:Removed 7 and 7 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(19 unique tokens: ['a', 'b', 'c', 'd', 'e']...) from 2 documents (total 62 corpus positions)
INFO:gensim.models.keyedvectors:Removed 13 and 14 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary

INFO:gensim.models.keyedvectors:Removed 8 and 9 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(16 unique tokens: ['a', 'c', 'e', 'f', 'h']...) from 2 documents (total 68 corpus positions)
INFO:gensim.models.keyedvectors:Removed 7 and 6 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(19 unique tokens: ['a', 'b', 'c', 'd', 'e']...) from 2 documents (total 66 corpus positions)
INFO:gensim.models.keyedvectors:Removed 7 and 10 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(19 unique tokens: ['a', 'b', 'c', 'd', 'e']...) from 2 documents (total 89 corpus positions)
INFO:gensim.models.keyedvectors:R

INFO:gensim.corpora.dictionary:built Dictionary(17 unique tokens: ['a', 'b', 'c', 'd', 'e']...) from 2 documents (total 94 corpus positions)
INFO:gensim.models.keyedvectors:Removed 12 and 11 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(19 unique tokens: ['a', 'b', 'c', 'd', 'e']...) from 2 documents (total 93 corpus positions)
INFO:gensim.models.keyedvectors:Removed 8 and 7 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(18 unique tokens: ['a', 'c', 'd', 'e', 'f']...) from 2 documents (total 84 corpus positions)
INFO:gensim.models.keyedvectors:Removed 9 and 5 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:b

INFO:gensim.models.keyedvectors:Removed 13 and 15 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(19 unique tokens: ['a', 'b', 'd', 'e', 'g']...) from 2 documents (total 125 corpus positions)
INFO:gensim.models.keyedvectors:Removed 12 and 10 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(21 unique tokens: ['a', 'b', 'c', 'd', 'e']...) from 2 documents (total 79 corpus positions)
INFO:gensim.models.keyedvectors:Removed 9 and 9 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(17 unique tokens: ['a', 'c', 'd', 'e', 'g']...) from 2 documents (total 84 corpus positions)
INFO:gensim.models.keyedvecto

INFO:gensim.corpora.dictionary:built Dictionary(21 unique tokens: ['a', 'b', 'c', 'd', 'e']...) from 2 documents (total 161 corpus positions)
INFO:gensim.models.keyedvectors:Removed 14 and 13 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(18 unique tokens: ['a', 'c', 'd', 'e', 'f']...) from 2 documents (total 95 corpus positions)
INFO:gensim.models.keyedvectors:Removed 4 and 6 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(16 unique tokens: ['a', 'b', 'c', 'e', 'f']...) from 2 documents (total 55 corpus positions)
INFO:gensim.models.keyedvectors:Removed 14 and 17 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionar

INFO:gensim.models.keyedvectors:Removed 8 and 8 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(16 unique tokens: ['a', 'c', 'e', 'g', 'h']...) from 2 documents (total 94 corpus positions)
INFO:gensim.models.keyedvectors:Removed 14 and 12 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(19 unique tokens: ['a', 'b', 'c', 'd', 'e']...) from 2 documents (total 114 corpus positions)
INFO:gensim.models.keyedvectors:Removed 14 and 15 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(19 unique tokens: ['a', 'b', 'c', 'd', 'e']...) from 2 documents (total 92 corpus positions)
INFO:gensim.models.keyedvecto

INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(20 unique tokens: ['a', 'b', 'c', 'd', 'e']...) from 2 documents (total 90 corpus positions)
INFO:gensim.models.keyedvectors:Removed 7 and 7 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(16 unique tokens: ['a', 'c', 'd', 'e', 'f']...) from 2 documents (total 66 corpus positions)
INFO:gensim.models.keyedvectors:Removed 16 and 12 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(20 unique tokens: ['a', 'b', 'c', 'd', 'e']...) from 2 documents (total 110 corpus positions)
INFO:gensim.models.keyedvectors:Removed 10 and 9 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary

INFO:gensim.models.keyedvectors:Removed 12 and 13 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(17 unique tokens: ['a', 'b', 'd', 'e', 'f']...) from 2 documents (total 100 corpus positions)
INFO:gensim.models.keyedvectors:Removed 7 and 7 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(15 unique tokens: ['a', 'b', 'd', 'e', 'g']...) from 2 documents (total 51 corpus positions)
INFO:gensim.models.keyedvectors:Removed 17 and 21 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(21 unique tokens: ['a', 'b', 'c', 'd', 'e']...) from 2 documents (total 159 corpus positions)
INFO:gensim.models.keyedvect

INFO:gensim.corpora.dictionary:built Dictionary(18 unique tokens: ['a', 'b', 'd', 'e', 'g']...) from 2 documents (total 73 corpus positions)
INFO:gensim.models.keyedvectors:Removed 12 and 9 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(15 unique tokens: ['a', 'c', 'd', 'e', 'g']...) from 2 documents (total 80 corpus positions)
INFO:gensim.models.keyedvectors:Removed 5 and 6 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(17 unique tokens: ['a', 'b', 'e', 'f', 'g']...) from 2 documents (total 64 corpus positions)
INFO:gensim.models.keyedvectors:Removed 22 and 18 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:

INFO:gensim.models.keyedvectors:Removed 11 and 10 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(19 unique tokens: ['a', 'c', 'd', 'e', 'f']...) from 2 documents (total 133 corpus positions)
INFO:gensim.models.keyedvectors:Removed 13 and 12 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(17 unique tokens: ['a', 'b', 'd', 'e', 'f']...) from 2 documents (total 103 corpus positions)
INFO:gensim.models.keyedvectors:Removed 6 and 5 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(20 unique tokens: ['a', 'c', 'e', 'g', 'i']...) from 2 documents (total 97 corpus positions)
INFO:gensim.models.keyedvect

INFO:gensim.corpora.dictionary:built Dictionary(20 unique tokens: ['a', 'b', 'c', 'd', 'e']...) from 2 documents (total 87 corpus positions)
INFO:gensim.models.keyedvectors:Removed 12 and 12 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(20 unique tokens: ['a', 'b', 'c', 'e', 'g']...) from 2 documents (total 87 corpus positions)
INFO:gensim.models.keyedvectors:Removed 13 and 12 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(21 unique tokens: ['a', 'b', 'd', 'e', 'g']...) from 2 documents (total 105 corpus positions)
INFO:gensim.models.keyedvectors:Removed 14 and 15 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.diction

INFO:gensim.models.keyedvectors:Removed 11 and 14 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(19 unique tokens: ['a', 'b', 'c', 'e', 'f']...) from 2 documents (total 112 corpus positions)
INFO:gensim.models.keyedvectors:Removed 8 and 8 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(16 unique tokens: ['a', 'b', 'd', 'e', 'f']...) from 2 documents (total 66 corpus positions)
INFO:gensim.models.keyedvectors:Removed 12 and 12 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(19 unique tokens: ['a', 'b', 'c', 'd', 'e']...) from 2 documents (total 133 corpus positions)
INFO:gensim.models.keyedvect

INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(17 unique tokens: ['a', 'd', 'e', 'f', 'g']...) from 2 documents (total 70 corpus positions)
INFO:gensim.models.keyedvectors:Removed 15 and 15 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(20 unique tokens: ['a', 'b', 'c', 'd', 'e']...) from 2 documents (total 146 corpus positions)
INFO:gensim.models.keyedvectors:Removed 9 and 9 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(16 unique tokens: ['a', 'c', 'd', 'e', 'g']...) from 2 documents (total 84 corpus positions)
INFO:gensim.models.keyedvectors:Removed 21 and 21 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionar

INFO:gensim.models.keyedvectors:Removed 11 and 12 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(18 unique tokens: ['a', 'b', 'c', 'd', 'e']...) from 2 documents (total 104 corpus positions)
INFO:gensim.models.keyedvectors:Removed 9 and 9 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(19 unique tokens: ['a', 'b', 'c', 'd', 'e']...) from 2 documents (total 92 corpus positions)
INFO:gensim.models.keyedvectors:Removed 7 and 4 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(16 unique tokens: ['a', 'b', 'd', 'e', 'f']...) from 2 documents (total 43 corpus positions)
INFO:gensim.models.keyedvectors

INFO:gensim.corpora.dictionary:built Dictionary(20 unique tokens: ['a', 'b', 'c', 'd', 'e']...) from 2 documents (total 96 corpus positions)
INFO:gensim.models.keyedvectors:Removed 10 and 10 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(18 unique tokens: ['a', 'b', 'c', 'd', 'e']...) from 2 documents (total 71 corpus positions)
INFO:gensim.models.keyedvectors:Removed 13 and 9 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(20 unique tokens: ['a', 'd', 'e', 'g', 'h']...) from 2 documents (total 94 corpus positions)
INFO:gensim.models.keyedvectors:Removed 12 and 13 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionar

INFO:gensim.models.keyedvectors:Removed 17 and 19 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(19 unique tokens: ['a', 'd', 'e', 'g', 'h']...) from 2 documents (total 138 corpus positions)
INFO:gensim.models.keyedvectors:Removed 16 and 17 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(21 unique tokens: ['a', 'b', 'c', 'd', 'e']...) from 2 documents (total 137 corpus positions)
INFO:gensim.models.keyedvectors:Removed 14 and 11 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(22 unique tokens: ['a', 'b', 'c', 'd', 'e']...) from 2 documents (total 141 corpus positions)
INFO:gensim.models.keyedv

INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(18 unique tokens: ['a', 'b', 'd', 'e', 'g']...) from 2 documents (total 114 corpus positions)
INFO:gensim.models.keyedvectors:Removed 16 and 6 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(20 unique tokens: ['a', 'c', 'd', 'e', 'f']...) from 2 documents (total 97 corpus positions)
INFO:gensim.models.keyedvectors:Removed 11 and 12 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(21 unique tokens: ['a', 'b', 'c', 'd', 'e']...) from 2 documents (total 107 corpus positions)
INFO:gensim.models.keyedvectors:Removed 14 and 16 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.diction

INFO:gensim.models.keyedvectors:Removed 13 and 13 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(19 unique tokens: ['a', 'b', 'c', 'd', 'e']...) from 2 documents (total 100 corpus positions)
INFO:gensim.models.keyedvectors:Removed 12 and 9 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(21 unique tokens: ['a', 'b', 'c', 'd', 'e']...) from 2 documents (total 102 corpus positions)
INFO:gensim.models.keyedvectors:Removed 10 and 11 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(20 unique tokens: ['a', 'c', 'e', 'g', 'h']...) from 2 documents (total 91 corpus positions)
INFO:gensim.models.keyedvec

INFO:gensim.corpora.dictionary:built Dictionary(20 unique tokens: ['a', 'b', 'c', 'd', 'e']...) from 2 documents (total 105 corpus positions)
INFO:gensim.models.keyedvectors:Removed 6 and 6 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(17 unique tokens: ['a', 'b', 'c', 'e', 'g']...) from 2 documents (total 63 corpus positions)
INFO:gensim.models.keyedvectors:Removed 12 and 12 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(19 unique tokens: ['a', 'b', 'c', 'd', 'e']...) from 2 documents (total 113 corpus positions)
INFO:gensim.models.keyedvectors:Removed 9 and 9 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary

INFO:gensim.models.keyedvectors:Removed 16 and 19 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(22 unique tokens: ['a', 'b', 'c', 'd', 'e']...) from 2 documents (total 114 corpus positions)
INFO:gensim.models.keyedvectors:Removed 8 and 12 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(17 unique tokens: ['a', 'b', 'd', 'e', 'g']...) from 2 documents (total 84 corpus positions)
INFO:gensim.models.keyedvectors:Removed 18 and 13 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(18 unique tokens: ['a', 'd', 'e', 'f', 'g']...) from 2 documents (total 103 corpus positions)
INFO:gensim.models.keyedvec

INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(19 unique tokens: ['a', 'c', 'd', 'g', 'h']...) from 2 documents (total 74 corpus positions)
INFO:gensim.models.keyedvectors:Removed 14 and 11 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(20 unique tokens: ['a', 'c', 'd', 'e', 'g']...) from 2 documents (total 121 corpus positions)
INFO:gensim.models.keyedvectors:Removed 8 and 7 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(19 unique tokens: ['a', 'b', 'd', 'e', 'g']...) from 2 documents (total 79 corpus positions)
INFO:gensim.models.keyedvectors:Removed 11 and 14 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionar

In [164]:
semantic_scores = [1-score for score in distance]

### BLEU or ROUGE scores

Use BLEU scores for machine translation evaluation and ROUGE for text summarization evaluation.

In [134]:
# for machine translation evaluation
bleu_scores =[]
for i in range(len(reference_en)):
    bleu_scores.append(sentence_bleu(reference_en[i],prediction_en[i], smoothing_function=smoother.method4))

In [30]:
# for text summarization evaluation
rouge_scores = []
for i in range(len(reference_en)):
    *pr, f = rouge_n_sentence_level(prediction_en[i], reference_en[i], 2) # 2 for ROUGE-2. ROUGE-N, ROUGE-L and ROUGE-W scores can also be obtained.
    rouge_scores.append(f)

### Human annotation scores

Load the human annotation scores from the respective excel files as below,

- For **DE-EN** translation, 'human annotated/DE-EN.xlsx'


- For **RO-EN** translation, 'human annotated/RO-EN.xlsx'


- For **giga word** summarization(titles),'human annotated/giga.xlsx'


- For **CNN-DM** summariation, 'human annotated/CNN_900.xlsx'


- For **Duc 2003** summarization,  'human annotated/duc2003.xlsx'


In [135]:
human_annotation = pd.read_excel('human annotated/DE-EN.xlsx')

In [136]:
human_scores = human_annotation.iloc[:, 3].tolist()

### Pearson correlation coefficient

In [137]:
# correlation between human annotated scores and Bleu or ROUGE scores

#pearson correlation value, p-value
pearsonr(human_scores, bleu_scores) #bleu_scores or rouge_scores

(0.28439322985388266, 4.638694382037051e-20)

In [165]:
# correlation between human annotated scores and semantic similarity scores

pearsonr(human_scores, semantic_scores) # expected to be higher(more correlated) than with Bleu or ROUGE scores

(0.5173782719035751, 1.4891641455804286e-69)