# Word Mover's Distance (WMD)

Word Mover's Distance  measures the distance between two documents or their word embeddings in a meaningful way, even if they have no words in common


Usually, one measures the distance between two word or sentence vectors using the cosine distance , which measures the angle between vectors. 
WMD, on the other hand, uses the Euclidean distance.  The Euclidean distance between two vectors might be large because their lengths differ, but the cosine distance is small because the angle between them is small, we can mitigate some of this by normalizing the vectors.

In [1]:
#Imports

from gensim.similarities import WmdSimilarity
from pyemd import emd
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec
import pandas as pd
from nltk.translate.bleu_score import sentence_bleu
from nltk.translate.bleu_score import SmoothingFunction
smoother = SmoothingFunction()
from rouge.rouge import rouge_n_sentence_level # pip install easy-rouge
from scipy.stats import pearsonr

In [2]:
# imports for preprocessing

import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize 
import nltk
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\d072726\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\d072726\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### Pretrained word embeddings

- Download fasttext pretrained word embeddings [here](https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.en.vec)
- Download glove pretrained word embeddings [here](http://nlp.stanford.edu/data/glove.840B.300d.zip)

Unzip the glove embeddings and save the embeddings in a folder Pretrained_embeddings.

In [3]:
#For convience convert the fasttext and glove embeddings to word2vec format

#for using fasttext embeddings

fasttext_model = KeyedVectors.load_word2vec_format('../Pretrained_embeddings/wiki.en.vec')

INFO:gensim.models.utils_any2vec:loading projection weights from ../Pretrained_embeddings/wiki.en.vec
INFO:gensim.models.utils_any2vec:loaded (2519370, 300) matrix from ../Pretrained_embeddings/wiki.en.vec


In [30]:
#for using glove embeddings

glove_file = '../Pretrained_embeddings/glove.840B.300d.txt'
tmp_file = '../Pretrained_embeddings/glove_word2vec.txt'
#_ = glove2word2vec(glove_file, tmp_file)

glove_model = KeyedVectors.load_word2vec_format(tmp_file)

INFO:gensim.models.utils_any2vec:loading projection weights from ../pretrained_embeddings/glove_word2vec.txt
INFO:gensim.models.utils_any2vec:duplicate words detected, shrinking matrix size from 2196017 to 2196016
INFO:gensim.models.utils_any2vec:loaded (2196016, 300) matrix from ../pretrained_embeddings/glove_word2vec.txt


In [4]:
#normalizing the vectors

fasttext_model.init_sims(replace=True) #glove_model for using glove embeddings

INFO:gensim.models.keyedvectors:precomputing L2-norms of word weight vectors


### Load testsets for evaluation

The Automatically generated candidate texts (predictions) from machine translation or text summarization are evaluated against their reference texts. <br> Below are the testsets to be used for evaluation. 

- For **DE-EN** translation, <br> **Candidate-**   '../Testsets/DE-EN/multi30k.test.pred.en.atok'  **Reference-**      '../Testsets/DE-EN/test2016.en.atok'    <br>


- For **RO-EN** translation, <br> **Candidate-**-   '../Testsets/RO-EN/newstest2016_output_1000.en'  **Reference-**    '../Testsets/RO-EN/newstest2016_ref_1000.en'  <br>


- For **CNN-DM** summariation, <br> **Candidate-**   '../Testsets/CNN-DM/preprocessed_1000.pred'  **Reference-** '../Testsets/CNN-DM/preprocessed_1000.ref'  


- For **DUC2003** summarization, <br> **Candidate-**  '../Testsets/DUC2003/duc2003.10_300000-500.txt'  **Reference-** '../Testsets/DUC2003/task1_ref0_duc2003-500.txt'  


- For **Gigaword** summarization (titles), <br>  **Candidate-**  '../Testsets/Gigaword/giga.10_300000_500.txt'  **Reference-** '../Testsets/Gigaword/task1_ref0_giga_500.txt' 

In [10]:
candidate_doc =  '../Testsets/DE-EN/multi30k.test.pred.en.atok'  
reference_doc = '../Testsets/DE-EN/test2016.en.atok' 

with  open( candidate_doc ,'r') as cand, open( reference_doc ,'r') as ref:
    candidate_en = cand.readlines()
    reference_en = ref.readlines()   

In [6]:
candidate_en[:5]

['A man in an orange hat presenting something .\n',
 'A Boston traveler runs across lush , green fence in front of a white fence .\n',
 'A girl in a karate uniform is blocking a board with a kick .\n',
 'Five people in winter jackets and helmets are standing in the snow with vials in the background .\n',
 'People moving off the roof of a house .\n']

In [7]:
reference_en[:5]

['A man in an orange hat starring at something .\n',
 'A Boston Terrier is running on lush green grass in front of a white fence .\n',
 'A girl in karate uniform breaking a stick with a front kick .\n',
 'Five people wearing winter jackets and helmets stand in the snow , with snowmobiles in the background .\n',
 'People are fixing the roof of a house .\n']

###  Optional preprocessing

In [8]:
def preprocessing(doc, stop_words_remove=False):
    remove_punctuation = []
    preprocessed_doc = []
    # keep only alphanumeric characters(remove punctuations)
    remove_punctuation = [re.sub(r"[^\w]", " ", sent).lower().strip() for sent in doc] 
    
    if stop_words_remove == True:
        # remove stop words requires lower cased tokens
        stop_words = set(stopwords.words("english"))
        for sent in doc:
            filtered_sentence = [word for word in word_tokenize(sent) if not word in stop_words]
            preprocessed_doc.append(' '.join(filtered_sentence))
        return preprocessed_doc
    else:
        return remove_punctuation  

In [9]:
# use only if you want to preprocess the sentences

candidate_en = preprocessing(candidate_en, False) # True to remove stopwords, default only removes punctuation
reference_en = preprocessing(reference_en, False) 

### Semantic similarity scores

In [11]:
distance = []
for i in range(len(reference_en)):
    distance.append(fasttext_model.wmdistance(candidate_en[i], reference_en[i]))

INFO:gensim.models.keyedvectors:Removed 10 and 11 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(13 unique tokens: ['.', 'a', 'e', 'g', 'h']...) from 2 documents (total 72 corpus positions)
INFO:gensim.models.keyedvectors:Removed 18 and 19 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(18 unique tokens: [',', '.', 'a', 'c', 'e']...) from 2 documents (total 116 corpus positions)
INFO:gensim.models.keyedvectors:Removed 15 and 14 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(20 unique tokens: ['.', 'a', 'b', 'c', 'd']...) from 2 documents (total 94 corpus positions)
INFO:gensim.models.keyedvec

INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(20 unique tokens: ['.', 'a', 'b', 'c', 'd']...) from 2 documents (total 116 corpus positions)
INFO:gensim.models.keyedvectors:Removed 9 and 8 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(16 unique tokens: ['.', 'a', 'b', 'c', 'e']...) from 2 documents (total 55 corpus positions)
INFO:gensim.models.keyedvectors:Removed 15 and 14 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(21 unique tokens: ['.', 'a', 'b', 'e', 'f']...) from 2 documents (total 88 corpus positions)
INFO:gensim.models.keyedvectors:Removed 11 and 12 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionar

INFO:gensim.models.keyedvectors:Removed 16 and 13 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(21 unique tokens: ['.', 'a', 'b', 'c', 'd']...) from 2 documents (total 109 corpus positions)
INFO:gensim.models.keyedvectors:Removed 11 and 10 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(17 unique tokens: ['.', 'a', 'c', 'd', 'e']...) from 2 documents (total 64 corpus positions)
INFO:gensim.models.keyedvectors:Removed 11 and 12 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(17 unique tokens: ['.', 'a', 'e', 'h', 'i']...) from 2 documents (total 61 corpus positions)
INFO:gensim.models.keyedvec

INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(18 unique tokens: ['.', 'a', 'b', 'd', 'e']...) from 2 documents (total 76 corpus positions)
INFO:gensim.models.keyedvectors:Removed 13 and 14 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(17 unique tokens: ['.', 'a', 'b', 'd', 'e']...) from 2 documents (total 94 corpus positions)
INFO:gensim.models.keyedvectors:Removed 17 and 14 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(19 unique tokens: ['.', 'a', 'b', 'd', 'e']...) from 2 documents (total 107 corpus positions)
INFO:gensim.models.keyedvectors:Removed 23 and 21 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.diction

INFO:gensim.models.keyedvectors:Removed 16 and 20 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(23 unique tokens: ['.', 'a', 'b', 'c', 'd']...) from 2 documents (total 124 corpus positions)
INFO:gensim.models.keyedvectors:Removed 13 and 14 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(20 unique tokens: ['.', 'a', 'c', 'd', 'e']...) from 2 documents (total 87 corpus positions)
INFO:gensim.models.keyedvectors:Removed 11 and 11 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(19 unique tokens: ['a', 'b', 'c', 'd', 'e']...) from 2 documents (total 76 corpus positions)
INFO:gensim.models.keyedvec

INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(19 unique tokens: ['.', 'a', 'b', 'c', 'd']...) from 2 documents (total 118 corpus positions)
INFO:gensim.models.keyedvectors:Removed 11 and 11 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(18 unique tokens: ['.', 'a', 'b', 'c', 'd']...) from 2 documents (total 71 corpus positions)
INFO:gensim.models.keyedvectors:Removed 15 and 16 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(22 unique tokens: ['.', 'a', 'b', 'd', 'e']...) from 2 documents (total 111 corpus positions)
INFO:gensim.models.keyedvectors:Removed 14 and 14 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictio

INFO:gensim.models.keyedvectors:Removed 12 and 13 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(21 unique tokens: ['.', 'a', 'b', 'c', 'd']...) from 2 documents (total 120 corpus positions)
INFO:gensim.models.keyedvectors:Removed 9 and 10 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(20 unique tokens: ['.', 'a', 'b', 'c', 'd']...) from 2 documents (total 81 corpus positions)
INFO:gensim.models.keyedvectors:Removed 12 and 13 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(19 unique tokens: ['.', 'a', 'd', 'e', 'g']...) from 2 documents (total 83 corpus positions)
INFO:gensim.models.keyedvect

INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(22 unique tokens: ['.', 'a', 'c', 'e', 'g']...) from 2 documents (total 97 corpus positions)
INFO:gensim.models.keyedvectors:Removed 13 and 12 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(20 unique tokens: ['.', 'a', 'b', 'c', 'd']...) from 2 documents (total 88 corpus positions)
INFO:gensim.models.keyedvectors:Removed 9 and 10 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(18 unique tokens: ['.', 'a', 'c', 'd', 'h']...) from 2 documents (total 54 corpus positions)
INFO:gensim.models.keyedvectors:Removed 17 and 20 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionar

INFO:gensim.models.keyedvectors:Removed 14 and 12 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(17 unique tokens: ['.', 'a', 'b', 'd', 'e']...) from 2 documents (total 96 corpus positions)
INFO:gensim.models.keyedvectors:Removed 9 and 10 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(18 unique tokens: ['.', 'a', 'b', 'c', 'd']...) from 2 documents (total 50 corpus positions)
INFO:gensim.models.keyedvectors:Removed 11 and 14 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(21 unique tokens: ['.', 'a', 'b', 'c', 'd']...) from 2 documents (total 84 corpus positions)
INFO:gensim.models.keyedvecto

INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(25 unique tokens: ['"', '-', '.', '@', 'a']...) from 2 documents (total 125 corpus positions)
INFO:gensim.models.keyedvectors:Removed 16 and 16 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(17 unique tokens: ['.', 'a', 'd', 'e', 'f']...) from 2 documents (total 103 corpus positions)
INFO:gensim.models.keyedvectors:Removed 15 and 21 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(19 unique tokens: ['.', 'a', 'b', 'e', 'g']...) from 2 documents (total 127 corpus positions)
INFO:gensim.models.keyedvectors:Removed 9 and 9 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.diction

INFO:gensim.models.keyedvectors:Removed 13 and 17 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(21 unique tokens: ['.', 'a', 'b', 'c', 'd']...) from 2 documents (total 96 corpus positions)
INFO:gensim.models.keyedvectors:Removed 11 and 12 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(18 unique tokens: ['.', 'a', 'c', 'd', 'e']...) from 2 documents (total 71 corpus positions)
INFO:gensim.models.keyedvectors:Removed 16 and 15 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(22 unique tokens: ['.', 'a', 'b', 'c', 'd']...) from 2 documents (total 110 corpus positions)
INFO:gensim.models.keyedvec

INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(17 unique tokens: ['.', 'a', 'b', 'c', 'd']...) from 2 documents (total 95 corpus positions)
INFO:gensim.models.keyedvectors:Removed 13 and 13 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(23 unique tokens: [',', '.', 'a', 'b', 'c']...) from 2 documents (total 87 corpus positions)
INFO:gensim.models.keyedvectors:Removed 10 and 9 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(13 unique tokens: ['.', 'a', 'b', 'c', 'd']...) from 2 documents (total 56 corpus positions)
INFO:gensim.models.keyedvectors:Removed 13 and 12 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionar

INFO:gensim.models.keyedvectors:Removed 12 and 11 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(21 unique tokens: ['.', 'a', 'b', 'c', 'd']...) from 2 documents (total 93 corpus positions)
INFO:gensim.models.keyedvectors:Removed 14 and 14 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(21 unique tokens: ['.', 'a', 'b', 'd', 'e']...) from 2 documents (total 126 corpus positions)
INFO:gensim.models.keyedvectors:Removed 11 and 11 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(18 unique tokens: ['.', 'a', 'b', 'c', 'd']...) from 2 documents (total 92 corpus positions)
INFO:gensim.models.keyedvec

INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(22 unique tokens: ['.', 'a', 'c', 'd', 'e']...) from 2 documents (total 130 corpus positions)
INFO:gensim.models.keyedvectors:Removed 15 and 14 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(20 unique tokens: ['.', 'a', 'c', 'd', 'e']...) from 2 documents (total 99 corpus positions)
INFO:gensim.models.keyedvectors:Removed 20 and 20 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(19 unique tokens: ['.', 'a', 'c', 'd', 'e']...) from 2 documents (total 144 corpus positions)
INFO:gensim.models.keyedvectors:Removed 14 and 12 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictio

INFO:gensim.models.keyedvectors:Removed 22 and 30 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(24 unique tokens: ['.', 'a', 'b', 'c', 'd']...) from 2 documents (total 190 corpus positions)
INFO:gensim.models.keyedvectors:Removed 11 and 12 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(18 unique tokens: ['.', 'a', 'b', 'd', 'g']...) from 2 documents (total 62 corpus positions)
INFO:gensim.models.keyedvectors:Removed 13 and 13 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(22 unique tokens: ['.', 'a', 'c', 'e', 'f']...) from 2 documents (total 94 corpus positions)
INFO:gensim.models.keyedvec

INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(21 unique tokens: ['"', '.', 'b', 'c', 'd']...) from 2 documents (total 73 corpus positions)
INFO:gensim.models.keyedvectors:Removed 13 and 16 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(21 unique tokens: ['.', 'a', 'c', 'd', 'e']...) from 2 documents (total 105 corpus positions)
INFO:gensim.models.keyedvectors:Removed 14 and 15 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(19 unique tokens: ['.', 'a', 'b', 'c', 'e']...) from 2 documents (total 84 corpus positions)
INFO:gensim.models.keyedvectors:Removed 13 and 12 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.diction

INFO:gensim.models.keyedvectors:Removed 12 and 12 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(20 unique tokens: ['.', 'a', 'd', 'e', 'f']...) from 2 documents (total 81 corpus positions)
INFO:gensim.models.keyedvectors:Removed 12 and 12 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(20 unique tokens: ['.', 'a', 'c', 'd', 'e']...) from 2 documents (total 100 corpus positions)
INFO:gensim.models.keyedvectors:Removed 13 and 13 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(20 unique tokens: ['.', 'a', 'b', 'c', 'd']...) from 2 documents (total 88 corpus positions)
INFO:gensim.models.keyedvec

INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(20 unique tokens: ['.', 'a', 'c', 'd', 'e']...) from 2 documents (total 107 corpus positions)
INFO:gensim.models.keyedvectors:Removed 17 and 14 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(23 unique tokens: ['.', 'a', 'b', 'c', 'd']...) from 2 documents (total 131 corpus positions)
INFO:gensim.models.keyedvectors:Removed 9 and 12 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(17 unique tokens: ['.', 'a', 'd', 'e', 'l']...) from 2 documents (total 64 corpus positions)
INFO:gensim.models.keyedvectors:Removed 15 and 10 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.diction

INFO:gensim.models.keyedvectors:Removed 12 and 12 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(18 unique tokens: ['.', 'a', 'b', 'd', 'e']...) from 2 documents (total 74 corpus positions)
INFO:gensim.models.keyedvectors:Removed 12 and 11 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(17 unique tokens: ['.', 'a', 'c', 'e', 'f']...) from 2 documents (total 77 corpus positions)
INFO:gensim.models.keyedvectors:Removed 14 and 16 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(18 unique tokens: ['.', 'a', 'c', 'd', 'e']...) from 2 documents (total 110 corpus positions)
INFO:gensim.models.keyedvec

INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(18 unique tokens: ['.', 'a', 'd', 'e', 'g']...) from 2 documents (total 92 corpus positions)
INFO:gensim.models.keyedvectors:Removed 16 and 14 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(18 unique tokens: ['.', 'a', 'b', 'c', 'd']...) from 2 documents (total 99 corpus positions)
INFO:gensim.models.keyedvectors:Removed 10 and 10 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(20 unique tokens: ['.', 'a', 'b', 'c', 'd']...) from 2 documents (total 62 corpus positions)
INFO:gensim.models.keyedvectors:Removed 14 and 15 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictiona

INFO:gensim.models.keyedvectors:Removed 12 and 11 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(17 unique tokens: ['.', 'a', 'e', 'f', 'g']...) from 2 documents (total 68 corpus positions)
INFO:gensim.models.keyedvectors:Removed 9 and 10 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(19 unique tokens: ['.', 'c', 'd', 'e', 'f']...) from 2 documents (total 66 corpus positions)
INFO:gensim.models.keyedvectors:Removed 14 and 10 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(18 unique tokens: ['.', 'a', 'c', 'd', 'e']...) from 2 documents (total 88 corpus positions)
INFO:gensim.models.keyedvecto

INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(18 unique tokens: ['.', 'a', 'b', 'c', 'd']...) from 2 documents (total 94 corpus positions)
INFO:gensim.models.keyedvectors:Removed 14 and 15 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(20 unique tokens: ['.', 'a', 'b', 'c', 'd']...) from 2 documents (total 93 corpus positions)
INFO:gensim.models.keyedvectors:Removed 10 and 11 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(19 unique tokens: ['.', 'a', 'c', 'd', 'e']...) from 2 documents (total 84 corpus positions)
INFO:gensim.models.keyedvectors:Removed 8 and 12 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionar

INFO:gensim.models.keyedvectors:Removed 18 and 16 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(20 unique tokens: ['.', 'a', 'b', 'c', 'd']...) from 2 documents (total 125 corpus positions)
INFO:gensim.models.keyedvectors:Removed 13 and 15 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(22 unique tokens: ['.', 'a', 'b', 'c', 'd']...) from 2 documents (total 79 corpus positions)
INFO:gensim.models.keyedvectors:Removed 12 and 12 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(18 unique tokens: ['.', 'a', 'd', 'e', 'g']...) from 2 documents (total 84 corpus positions)
INFO:gensim.models.keyedvec

INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(22 unique tokens: ['.', 'a', 'b', 'c', 'd']...) from 2 documents (total 160 corpus positions)
INFO:gensim.models.keyedvectors:Removed 16 and 17 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(19 unique tokens: ['.', 'a', 'c', 'd', 'e']...) from 2 documents (total 95 corpus positions)
INFO:gensim.models.keyedvectors:Removed 8 and 6 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(15 unique tokens: ['a', 'b', 'c', 'e', 'f']...) from 2 documents (total 53 corpus positions)
INFO:gensim.models.keyedvectors:Removed 20 and 17 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionar

INFO:gensim.models.keyedvectors:Removed 11 and 11 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(17 unique tokens: ['.', 'a', 'c', 'e', 'g']...) from 2 documents (total 94 corpus positions)
INFO:gensim.models.keyedvectors:Removed 15 and 17 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(20 unique tokens: ['.', 'a', 'b', 'c', 'd']...) from 2 documents (total 114 corpus positions)
INFO:gensim.models.keyedvectors:Removed 15 and 14 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(22 unique tokens: ['-', '.', '@', 'a', 'b']...) from 2 documents (total 98 corpus positions)
INFO:gensim.models.keyedvec

INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(21 unique tokens: ['.', 'a', 'b', 'c', 'd']...) from 2 documents (total 90 corpus positions)
INFO:gensim.models.keyedvectors:Removed 10 and 10 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(17 unique tokens: ['.', 'a', 'c', 'd', 'e']...) from 2 documents (total 66 corpus positions)
INFO:gensim.models.keyedvectors:Removed 15 and 19 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(21 unique tokens: ['.', 'a', 'b', 'c', 'e']...) from 2 documents (total 110 corpus positions)
INFO:gensim.models.keyedvectors:Removed 12 and 13 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.diction

INFO:gensim.corpora.dictionary:built Dictionary(19 unique tokens: ['.', 'a', 'b', 'c', 'd']...) from 2 documents (total 82 corpus positions)
INFO:gensim.models.keyedvectors:Removed 15 and 14 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(19 unique tokens: [',', '.', 'a', 'b', 'd']...) from 2 documents (total 102 corpus positions)
INFO:gensim.models.keyedvectors:Removed 10 and 10 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(16 unique tokens: ['.', 'a', 'b', 'd', 'e']...) from 2 documents (total 51 corpus positions)
INFO:gensim.models.keyedvectors:Removed 24 and 20 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.diction

INFO:gensim.models.keyedvectors:Removed 11 and 11 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(19 unique tokens: ['.', 'a', 'b', 'd', 'e']...) from 2 documents (total 73 corpus positions)
INFO:gensim.models.keyedvectors:Removed 12 and 15 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(16 unique tokens: ['.', 'a', 'c', 'd', 'e']...) from 2 documents (total 80 corpus positions)
INFO:gensim.models.keyedvectors:Removed 9 and 8 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(18 unique tokens: ['.', 'a', 'b', 'c', 'e']...) from 2 documents (total 64 corpus positions)
INFO:gensim.models.keyedvector

INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(17 unique tokens: ['.', 'a', 'b', 'c', 'e']...) from 2 documents (total 92 corpus positions)
INFO:gensim.models.keyedvectors:Removed 13 and 14 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(20 unique tokens: ['.', 'a', 'c', 'd', 'e']...) from 2 documents (total 133 corpus positions)
INFO:gensim.models.keyedvectors:Removed 15 and 16 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(18 unique tokens: ['.', 'a', 'b', 'd', 'e']...) from 2 documents (total 103 corpus positions)
INFO:gensim.models.keyedvectors:Removed 8 and 13 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.diction

INFO:gensim.models.keyedvectors:Removed 16 and 15 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(22 unique tokens: [',', '.', 'a', 'b', 'd']...) from 2 documents (total 89 corpus positions)
INFO:gensim.models.keyedvectors:Removed 15 and 15 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(21 unique tokens: ['.', 'a', 'b', 'c', 'd']...) from 2 documents (total 87 corpus positions)
INFO:gensim.models.keyedvectors:Removed 15 and 16 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(22 unique tokens: ['.', 'a', 'c', 'd', 'e']...) from 2 documents (total 105 corpus positions)
INFO:gensim.models.keyedvec

INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(19 unique tokens: ['.', 'a', 'c', 'e', 'f']...) from 2 documents (total 90 corpus positions)
INFO:gensim.models.keyedvectors:Removed 17 and 14 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(20 unique tokens: ['.', 'a', 'c', 'e', 'f']...) from 2 documents (total 112 corpus positions)
INFO:gensim.models.keyedvectors:Removed 11 and 11 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(17 unique tokens: ['.', 'a', 'b', 'd', 'e']...) from 2 documents (total 66 corpus positions)
INFO:gensim.models.keyedvectors:Removed 15 and 15 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.diction

INFO:gensim.corpora.dictionary:built Dictionary(24 unique tokens: [',', '.', 'a', 'b', 'c']...) from 2 documents (total 131 corpus positions)
INFO:gensim.models.keyedvectors:Removed 11 and 10 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(18 unique tokens: ['.', 'a', 'c', 'd', 'e']...) from 2 documents (total 69 corpus positions)
INFO:gensim.models.keyedvectors:Removed 18 and 20 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(23 unique tokens: ["'", ',', '.', 'a', 'b']...) from 2 documents (total 144 corpus positions)
INFO:gensim.models.keyedvectors:Removed 12 and 11 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictio

INFO:gensim.models.keyedvectors:Removed 17 and 17 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(19 unique tokens: ['.', 'a', 'c', 'd', 'e']...) from 2 documents (total 115 corpus positions)
INFO:gensim.models.keyedvectors:Removed 15 and 13 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(19 unique tokens: ['.', 'a', 'b', 'c', 'd']...) from 2 documents (total 103 corpus positions)
INFO:gensim.models.keyedvectors:Removed 12 and 12 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(20 unique tokens: ['.', 'a', 'b', 'c', 'e']...) from 2 documents (total 92 corpus positions)
INFO:gensim.models.keyedve

INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(15 unique tokens: ['.', 'a', 'e', 'g', 'h']...) from 2 documents (total 79 corpus positions)
INFO:gensim.models.keyedvectors:Removed 14 and 14 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(21 unique tokens: ['.', 'a', 'd', 'e', 'f']...) from 2 documents (total 96 corpus positions)
INFO:gensim.models.keyedvectors:Removed 13 and 13 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(19 unique tokens: ['.', 'a', 'b', 'c', 'd']...) from 2 documents (total 71 corpus positions)
INFO:gensim.models.keyedvectors:Removed 12 and 16 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictiona

INFO:gensim.models.keyedvectors:Removed 11 and 9 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(23 unique tokens: ['.', 'a', 'b', 'd', 'e']...) from 2 documents (total 75 corpus positions)
INFO:gensim.models.keyedvectors:Removed 22 and 20 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(20 unique tokens: ['.', 'a', 'c', 'd', 'e']...) from 2 documents (total 138 corpus positions)
INFO:gensim.models.keyedvectors:Removed 20 and 19 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(22 unique tokens: ['.', 'a', 'c', 'd', 'e']...) from 2 documents (total 137 corpus positions)
INFO:gensim.models.keyedvec

INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(21 unique tokens: ['.', 'a', 'c', 'd', 'e']...) from 2 documents (total 132 corpus positions)
INFO:gensim.models.keyedvectors:Removed 16 and 14 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(19 unique tokens: ['.', 'a', 'b', 'd', 'e']...) from 2 documents (total 114 corpus positions)
INFO:gensim.models.keyedvectors:Removed 9 and 17 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(22 unique tokens: ['.', 'a', 'e', 'f', 'h']...) from 2 documents (total 99 corpus positions)
INFO:gensim.models.keyedvectors:Removed 15 and 14 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.diction

INFO:gensim.corpora.dictionary:built Dictionary(16 unique tokens: ['.', 'a', 'd', 'e', 'g']...) from 2 documents (total 81 corpus positions)
INFO:gensim.models.keyedvectors:Removed 25 and 24 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(23 unique tokens: ['.', 'a', 'c', 'd', 'e']...) from 2 documents (total 169 corpus positions)
INFO:gensim.models.keyedvectors:Removed 16 and 16 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(19 unique tokens: ['.', 'a', 'b', 'c', 'd']...) from 2 documents (total 100 corpus positions)
INFO:gensim.models.keyedvectors:Removed 12 and 15 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictio

INFO:gensim.models.keyedvectors:Removed 10 and 10 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(17 unique tokens: ['.', 'a', 'c', 'd', 'e']...) from 2 documents (total 72 corpus positions)
INFO:gensim.models.keyedvectors:Removed 15 and 16 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(21 unique tokens: ['.', 'a', 'b', 'c', 'd']...) from 2 documents (total 105 corpus positions)
INFO:gensim.models.keyedvectors:Removed 9 and 9 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(18 unique tokens: ['.', 'a', 'b', 'c', 'e']...) from 2 documents (total 63 corpus positions)
INFO:gensim.models.keyedvecto

INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(19 unique tokens: [',', '.', 'a', 'c', 'd']...) from 2 documents (total 104 corpus positions)
INFO:gensim.models.keyedvectors:Removed 14 and 16 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(18 unique tokens: [',', '.', 'a', 'b', 'd']...) from 2 documents (total 96 corpus positions)
INFO:gensim.models.keyedvectors:Removed 20 and 19 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(24 unique tokens: [',', '.', 'a', 'b', 'c']...) from 2 documents (total 116 corpus positions)
INFO:gensim.models.keyedvectors:Removed 12 and 10 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictio

INFO:gensim.corpora.dictionary:built Dictionary(19 unique tokens: [',', '.', 'a', 'b', 'd']...) from 2 documents (total 106 corpus positions)
INFO:gensim.models.keyedvectors:Removed 13 and 14 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(18 unique tokens: ['.', 'a', 'b', 'c', 'd']...) from 2 documents (total 73 corpus positions)
INFO:gensim.models.keyedvectors:Removed 11 and 12 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(20 unique tokens: ['.', 'a', 'b', 'c', 'd']...) from 2 documents (total 74 corpus positions)
INFO:gensim.models.keyedvectors:Removed 14 and 16 OOV words from document 1 and 2 (respectively).
INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.diction

In [12]:
semantic_scores = [1-score for score in distance]

### BLEU or ROUGE scores

Use BLEU scores for machine translation evaluation and ROUGE for text summarization evaluation.

In [13]:
# for machine translation evaluation

bleu_scores =[]
for i in range(len(reference_en)):
    bleu_scores.append(sentence_bleu(candidate_en[i],reference_en[i], smoothing_function=smoother.method4))

In [76]:
# for text summarization evaluation

rouge_scores = []
for i in range(len(reference_en)):
    *pr, f = rouge_n_sentence_level(candidate_en[i], reference_en[i], 1) # 2 for ROUGE-2. ROUGE-N, ROUGE-L and ROUGE-W scores can also be obtained.
    rouge_scores.append(f)

### Human annotation scores

Load the human annotation scores from the respective excel files as below,

- For **DE-EN** translation, '../Human annotations/DE-EN.xlsx'


- For **RO-EN** translation, '../Human annotations/RO-EN.xlsx'


- For **CNN-DM** summariation, '../Human annotations/CNN_1000.xlsx'


- For **DUC2003** summarization,  '../Human annotations/DUC2003.xlsx'


- For **Gigaword** summarization (titles),  '../Human annotations/Gigaword.xlsx'


In [14]:
human_annotation = pd.read_excel('../Human annotations/DE-EN.xlsx')

In [15]:
human_scores = human_annotation.iloc[:, 2].tolist()

### Pearson correlation coefficient

In [16]:
# correlation between human annotated scores and Bleu or ROUGE scores

#pearson correlation value, p-value

pearsonr(human_scores, bleu_scores) # bleu_scores or rouge_scores

(0.3901802069640419, 1.0390848166845472e-37)

In [17]:
# correlation between human annotated scores and semantic similarity scores

pearsonr(human_scores, semantic_scores) # expected to be higher(more correlated) than with Bleu or ROUGE scores

(0.5313644463150785, 5.9808048153502184e-74)