# Average word embeddings

Applying the strength of word embeddings or vectors to larger text formats, such as documents or sentences, is a very basic technique in NLP.

Suppose we have a sentence **T** , which is composed of words $w_{1}$, $w_{2}$, ⋯, $w_{n}$. Each word has a embedding $uw_{1}$, $uw_{2}$, ⋯, $uw_{n}$. So we define the sentence embedding as: $u_{\mathbf{T}}$:= $\frac{1}{n}$ $\sum_{i=1}^{n}$ ${w_{u_{i}}}$.

In [89]:
# imports
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec
import numpy as np
import pandas as pd
from nltk.translate.bleu_score import sentence_bleu
from nltk.translate.bleu_score import SmoothingFunction
smoother = SmoothingFunction()
from rouge.rouge import rouge_n_sentence_level # pip install easy-rouge
from scipy.stats import pearsonr

In [90]:
# imports for preprocessing
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize 
import nltk
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\d072726\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\d072726\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### Pretrained word embeddings

- Download fasttext embeddings [here](https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.en.vec)
- Download glove embeddings [here](http://nlp.stanford.edu/data/glove.840B.300d.zip)

Unzip the glove embeddings and save the embeddings in a folder pretrained_embeddings.

In [2]:
# for convience convert the fasttext and glove embeddings to word2vec format

# for using fasttext embeddings
fasttext_model = KeyedVectors.load_word2vec_format('pretrained_embeddings/wiki.en.vec')

In [260]:
# for using glove embeddings
glove_file = 'pretrained_embeddings/glove.840B.300d.txt'
tmp_file = 'pretrained_embeddings/glove_word2vec.txt'
_ = glove2word2vec(glove_file, tmp_file)

glove_model = KeyedVectors.load_word2vec_format(tmp_file)

INFO:gensim.scripts.glove2word2vec:converting 2196017 vectors from pretrained_embeddings/glove.840B.300d.txt to pretrained_embeddings/glove_word2vec.txt
INFO:gensim.models.utils_any2vec:loading projection weights from pretrained_embeddings/glove_word2vec.txt
INFO:gensim.models.utils_any2vec:duplicate words detected, shrinking matrix size from 2196017 to 2196016
INFO:gensim.models.utils_any2vec:loaded (2196016, 300) matrix from pretrained_embeddings/glove_word2vec.txt


### Load testsets for evaluation

The Automatically generated texts (predictions) from machine translation or text summarization are evaluated against their reference texts. <br> Below are the testsets to be used for evaluation. 

- For **DE-EN** translation, <br>  **reference-** 'testsets/de-en/test2016.en.atok'   **prediction-** 'testsets/de-en/multi30k.test.pred.en.atok' <br>


- For **RO-EN** translation, <br>  **reference-** 'testsets/ro-en/newstest2016_ref_1000.en'  **prediction-**- 'testsets/ro-en/newstest2016_output_1000.en'<br>


- For **giga word** summarization(titles), <br>  **reference-** 'testsets/giga/task1_ref0_giga_450.txt'  **prediction-**'testsets/giga/giga.10_300000_450.txt'


- For **CNN-DM** summariation, <br>  **reference-** 'testsets/cnn/preprocessed.ref'  **prediction-** 'testsets/cnn/preprocessed.pred'


- For **Duc 2003** summarization, <br>  **reference-** 'testsets/duc/task1_ref0_duc2003.txt'  **prediction-** 'testsets/duc/duc2003.10_300000.txt'

In [313]:
reference_doc = 'testsets/de-en/test2016.en.atok'
prediction_doc =  'testsets/de-en/multi30k.test.pred.en.atok'  

with open( reference_doc ,'r') as ref, open( prediction_doc ,'r') as pred:
    reference_en = ref.readlines()
    prediction_en = pred.readlines()

In [225]:
reference_en[:5]

['UN Chief Says There Is No Military Solution in Syria\n',
 'Secretary-General Ban Ki-moon says his response to Russia\'s stepped up military support for Syria is that "there is no military solution" to the nearly five-year conflict and more weapons will only worsen the violence and misery for millions of people .\n',
 'The U .N . chief again urged all parties, including the divided U .N . Security Council, to unite and support inclusive negotiations to find a political solution .\n',
 "Ban told a news conference Wednesday that he plans to meet with foreign ministers of the five permanent council nations - the U .S ., Russia, China, Britain and France - on the sidelines of the General Assembly's ministerial session later this month to discuss Syria .\n",
 'He expressed regret that divisions in the council and among the Syrian people and regional powers "made this situation unsolvable ."\n']

In [226]:
prediction_en[:5]

['UN chief says no military solutions to Syria\n',
 "Secretary General Ban Ki moon says Russia's response to Russia's military support for Syria is that there is no military solution to the conflict lasting nearly five years and more weapons would only exacerbate violence and suffering millions of people .\n",
 'The UN chief urged all parties again , including the divided UN Security Council to unify and support negotiations in order to find a political solution .\n',
 "Ban said at a conference Wednesday that he planned to meet with foreign ministers from five permanent countries permanently present on the council this month - the US , Russia , China , England and France - on the edge of the General Assembly's ministerial session to discuss Syria .\n",
 'Ban voiced regret that divisions within the council and between Syrian people and regional powers have made this intractable situation .\n']

###  Optional preprocessing

In [314]:
def preprocessing(doc, stop_words_remove=False):
    remove_punctuation = []
    preprocessed_doc = []
    # keep only alphanumeric characters(remove punctuations)
    remove_punctuation = [re.sub(r"[^\w]", " ", sent).lower().strip() for sent in doc] 
    
    if stop_words_remove == True:
        # remove stop words requires lower cased tokens
        stop_words = set(stopwords.words("english"))
        for sent in remove_punctuation:
            filtered_sentence = [word for word in word_tokenize(sent.lower()) if not word in stop_words]
            preprocessed_doc.append(' '.join(filtered_sentence))
        return preprocessed_doc
    else:
        return remove_punctuation  

In [315]:
# use only if you want to preprocess the sentences

reference_en = preprocessing(reference_en, True) # True to remove stopwords, default only removes punctuation
prediction_en = preprocessing(prediction_en, True)

### Semantic similarity scores

In [316]:
def document_vector(model, doc):
    
    # remove out-of-vocabulary words  
    doc_tokenize = [word for word in doc.lower().split() if word in model.vocab]
     
    return np.mean(model[doc_tokenize], axis=0) #mean of the word embeddings

In [317]:
ref_embedding = []
pred_embedding = []
for doc in reference_en:
    ref_embedding.append(document_vector(glove_model, doc)) #glove_model for using glove embeddings
for doc in prediction_en:
    pred_embedding.append(document_vector(glove_model, doc))

In [318]:
semantic_scores =[]
for i in range(len(ref_embedding)):
    semantic_scores.append(np.dot(ref_embedding[i],pred_embedding[i]) / (np.linalg.norm(ref_embedding[i])*(np.linalg.norm(pred_embedding[i]))))

### BLEU or ROUGE scores

Use BLEU scores for machine translation evaluation and ROUGE for text summarization evaluation.

In [320]:
# for machine translation evaluation
bleu_scores =[]
for i in range(len(reference_en)):
    bleu_scores.append(sentence_bleu(reference_en[i],prediction_en[i], smoothing_function=smoother.method4))

In [57]:
# for text summarization evaluation
rouge_scores = []
for i in range(len(reference_en)):
    *pr, f = rouge_n_sentence_level(prediction_en[i], reference_en[i], 2) # 2 for ROUGE-2. ROUGE-N, ROUGE-L and ROUGE-W scores can also be obtained.
    rouge_scores.append(f)

### Human annotation scores

Load the human annotation scores from the respective excel files as below,

- For **DE-EN** translation, 'human annotated/DE-EN.xlsx'


- For **RO-EN** translation, 'human annotated/RO-EN.xlsx'


- For **giga word** summarization(titles),'human annotated/giga.xlsx'


- For **CNN-DM** summariation, 'human annotated/CNN_900.xlsx'


- For **Duc 2003** summarization,  'human annotated/duc2003.xlsx'


In [304]:
human_annotation = pd.read_excel('human annotated/DE-EN.xlsx')

In [305]:
human_scores = human_annotation.iloc[:, 3].tolist()

### Pearson correlation coefficient

In [321]:
# correlation between human annotated scores and Bleu or ROUGE scores

#pearson correlation value, p-value
pearsonr(human_scores, bleu_scores) #bleu_scores or rouge_scores

(0.32502479700810627, 4.910746911787776e-26)

In [319]:
# correlation between human annotated scores and semantic similarity scores

pearsonr(human_scores, semantic_scores) # expected to be higher(more correlated) than with Bleu or ROUGE scores

(0.6281325519832321, 7.251607375115274e-111)