# Average word embeddings

Applying the strength of word embeddings or vectors to larger text formats, such as documents or sentences, is a very basic technique in NLP.

Suppose we have a sentence **T** , which is composed of words $w_{1}$, $w_{2}$, ⋯, $w_{n}$. Each word has a embedding $uw_{1}$, $uw_{2}$, ⋯, $uw_{n}$. So we define the sentence embedding as: $u_{\mathbf{T}}$:= $\frac{1}{n}$ $\sum_{i=1}^{n}$ ${w_{u_{i}}}$.

In [1]:
# imports

from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec
import numpy as np
import pandas as pd
from nltk.translate.bleu_score import sentence_bleu
from nltk.translate.bleu_score import SmoothingFunction
smoother = SmoothingFunction()
from rouge.rouge import rouge_n_sentence_level # pip install easy-rouge
from scipy.stats import pearsonr

In [2]:
# imports for preprocessing
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize 
import nltk
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Error loading stopwords: <urlopen error [Errno 11001]
[nltk_data]     getaddrinfo failed>
[nltk_data] Error loading punkt: <urlopen error [Errno 11001]
[nltk_data]     getaddrinfo failed>


False

### Pretrained word embeddings

- Download fasttext embeddings [here](https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.en.vec)
- Download glove embeddings [here](http://nlp.stanford.edu/data/glove.840B.300d.zip)

Unzip the glove embeddings and save the embeddings in a folder Pretrained_embeddings.

In [5]:
# for convience convert the fasttext and glove embeddings to word2vec format

# for using fasttext embeddings

fasttext_model = KeyedVectors.load_word2vec_format('../Pretrained_embeddings/wiki.en.vec')

NameError: name 'KeyedVectors' is not defined

In [68]:
# for using glove embeddings

glove_file = '../Pretrained_embeddings/glove.840B.300d.txt'
tmp_file = '../Pretrained_embeddings/glove_word2vec.txt'
#_ = glove2word2vec(glove_file, tmp_file) # to convert glove to word2vec format and save it in tmp_file

glove_model = KeyedVectors.load_word2vec_format(tmp_file)

INFO:gensim.models.utils_any2vec:loading projection weights from ../pretrained_embeddings/glove_word2vec.txt
INFO:gensim.models.utils_any2vec:duplicate words detected, shrinking matrix size from 2196017 to 2196016
INFO:gensim.models.utils_any2vec:loaded (2196016, 300) matrix from ../pretrained_embeddings/glove_word2vec.txt


### Load testsets for evaluation

The Automatically generated candidate texts (predictions) from machine translation or text summarization are evaluated against their reference texts. <br> Below are the testsets to be used for evaluation. 

- For **DE-EN** translation, <br> **Candidate-**   '../Testsets/DE-EN/multi30k.test.pred.en.atok'  **Reference-**      '../Testsets/DE-EN/test2016.en.atok'    <br>


- For **RO-EN** translation, <br> **Candidate-**-   '../Testsets/RO-EN/newstest2016_output_1000.en'  **Reference-**    '../Testsets/RO-EN/newstest2016_ref_1000.en'  <br>


- For **CNN-DM** summariation, <br> **Candidate-**   '../Testsets/CNN-DM/preprocessed_1000.pred'  **Reference-** '../Testsets/CNN-DM/preprocessed_1000.ref'  


- For **DUC2003** summarization, <br> **Candidate-**  '../Testsets/DUC2003/duc2003.10_300000-500.txt'  **Reference-** '../Testsets/DUC2003/task1_ref0_duc2003-500.txt'  


- For **Gigaword** summarization (titles), <br>  **Candidate-**  '../Testsets/Gigaword/giga.10_300000_500.txt'  **Reference-** '../Testsets/Gigaword/task1_ref0_giga_500.txt' 

In [2]:
candidate_doc =  '../Testsets/DE-EN/multi30k.test.pred.en.atok'  
reference_doc = '../Testsets/DE-EN/test2016.en.atok' 

with  open( candidate_doc ,'r') as cand, open( reference_doc ,'r') as ref:
    candidate_en = cand.readlines()
    reference_en = ref.readlines()    

In [3]:
candidate_en[:5]

['A man in an orange hat presenting something .\n',
 'A Boston traveler runs across lush , green fence in front of a white fence .\n',
 'A girl in a karate uniform is blocking a board with a kick .\n',
 'Five people in winter jackets and helmets are standing in the snow with vials in the background .\n',
 'People moving off the roof of a house .\n']

In [4]:
reference_en[:5]

['A man in an orange hat starring at something .\n',
 'A Boston Terrier is running on lush green grass in front of a white fence .\n',
 'A girl in karate uniform breaking a stick with a front kick .\n',
 'Five people wearing winter jackets and helmets stand in the snow , with snowmobiles in the background .\n',
 'People are fixing the roof of a house .\n']

###  Optional preprocessing

In [155]:
def preprocessing(doc, stop_words_remove=False):
    remove_punctuation = []
    preprocessed_doc = []
    # keep only alphanumeric characters(remove punctuations)
    remove_punctuation = [re.sub(r"[^\w]", " ", sent).lower().strip() for sent in doc] 
    
    if stop_words_remove == True:
        # remove stop words requires lower cased tokens
        stop_words = set(stopwords.words("english"))
        for sent in remove_punctuation:
            filtered_sentence = [word for word in word_tokenize(sent.lower()) if not word in stop_words]
            preprocessed_doc.append(' '.join(filtered_sentence))
        return preprocessed_doc
    else:
        return remove_punctuation  

In [156]:
# use only if you want to preprocess the sentences

candidate_en = preprocessing(candidate_en, False) # True to remove stopwords, default only removes punctuation
reference_en = preprocessing(reference_en, False) 

### Semantic similarity scores

In [146]:
def document_vector(model, doc):
    # remove out-of-vocabulary words  
    doc_tokenize = [word for word in doc.lower().split() if word in model.vocab]
     
    return np.mean(model[doc_tokenize], axis=0) #mean of the word embeddings

In [147]:
cand_embedding = []
ref_embedding = []

for doc in candidate_en:
    cand_embedding.append(document_vector(fasttext_model, doc)) # glove_model for using glove embeddings
for doc in reference_en:
    ref_embedding.append(document_vector(fasttext_model, doc)) 

In [148]:
semantic_scores =[]
for i in range(len(cand_embedding)):
    semantic_scores.append(np.dot(cand_embedding[i],ref_embedding[i]) / (np.linalg.norm(cand_embedding[i])*(np.linalg.norm(ref_embedding[i]))))

### BLEU or ROUGE scores

Use BLEU scores for machine translation evaluation and ROUGE for text summarization evaluation.

In [9]:
# for machine translation evaluation
bleu_scores =[]
for i in range(len(reference_en)):
    bleu_scores.append(sentence_bleu(candidate_en[i],reference_en[i], smoothing_function=smoother.method4))

In [18]:
rouge_scores

[0.7172413793103449,
 0.6540880503144655,
 0.8275862068965517,
 0.6308724832214765,
 0.7272727272727272,
 0.8266666666666667,
 0.7058823529411765,
 0.7200000000000001,
 0.7804878048780488,
 0.7407407407407407,
 0.5964912280701755,
 0.8181818181818182,
 0.676470588235294,
 0.7305389221556886,
 0.7672955974842768,
 0.6666666666666666,
 0.6758620689655171,
 0.6944444444444443,
 0.8235294117647058,
 0.7462686567164178,
 0.6716417910447762,
 0.543046357615894,
 0.64,
 0.7445255474452555,
 0.7704918032786884,
 0.6201550387596899,
 0.6229508196721311,
 0.672,
 0.6065573770491803,
 0.8051948051948051,
 0.7857142857142857,
 0.7424242424242424,
 0.684931506849315,
 0.4142857142857143,
 0.6805555555555557,
 0.7468354430379746,
 0.681159420289855,
 0.7230769230769231,
 0.5095541401273885,
 0.5045045045045046,
 0.75,
 0.7246376811594202,
 0.5801526717557253,
 0.7195121951219512,
 0.6890756302521008,
 0.5755395683453238,
 0.6708860759493671,
 0.7042253521126761,
 0.704225352112676,
 0.67647058823529

In [17]:
# for text summarization evaluation
rouge_scores = []
for i in range(len(reference_en)):
    *pr, f = rouge_n_sentence_level(candidate_en[i], reference_en[i], 1) # 2 for ROUGE-2. ROUGE-N, ROUGE-L and ROUGE-W scores can also be obtained.
    rouge_scores.append(f)

### Human annotation scores

Load the human annotation scores from the respective excel files as below,

- For **DE-EN** translation, '../Human annotations/DE-EN.xlsx'


- For **RO-EN** translation, '../Human annotations/RO-EN.xlsx'


- For **CNN-DM** summariation, '../Human annotations/CNN_1000.xlsx'


- For **DUC2003** summarization,  '../Human annotations/DUC2003.xlsx'


- For **Gigaword** summarization (titles),  '../Human annotations/Gigaword.xlsx'


In [6]:
human_annotation = pd.read_excel('../human annotated/DE-EN.xlsx')

In [7]:
human_scores = human_annotation.iloc[:, 2].tolist()

### Pearson correlation coefficient

In [8]:
# correlation between human annotated scores and Bleu or ROUGE scores

#pearson correlation value, p-value
pearsonr(human_scores, rouge_scores) #bleu_scores or rouge_scores

NameError: name 'rouge_scores' is not defined

In [149]:
# correlation between human annotated scores and semantic similarity scores

pearsonr(human_scores, semantic_scores) # expected to be higher(more correlated) than with Bleu or ROUGE scores

(0.20064031473151583, 6.146642639596484e-06)