# <b><u>Evaluation using Word Embedding : Word2Vec</u></b>

One of the main drawbacks in BLEU is no detection of proper grammar or appropriate placement / usage of words. <br>
The aim here is to deal with that by introducing features through word embedding, based on which we can determine how well a word is placed and related with the other words in the sentence. <br>
The reference sentences are correctly translated sentences. So; using the corpus of all reference sentences, a word embedding model is trained.<br>
After that, we have a vocabulary of words from the reference corpus, along with feature vectors for each word that represents it's relations with other words.<br>
Using these vectors, we try to find some evaluation score that may work better than BLEU in terms of semantics.<br>

In [1]:
import pandas as pd
import gensim
import re
import nltk
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

In [2]:
df = pd.read_csv("eng-french.csv")
df.columns = ["English","French"]
df = df["English"]
df

0                                                       Hi.
1                                                      Run!
2                                                      Run!
3                                                      Who?
4                                                      Wow!
                                ...                        
175616    Top-down economics never works, said Obama. "T...
175617    A carbon footprint is the amount of carbon dio...
175618    Death is something that we're often discourage...
175619    Since there are usually multiple websites on a...
175620    If someone who doesn't know your background sa...
Name: English, Length: 175621, dtype: object

### <u><b>Preprocessing:</b></u>

- All words converted to lower case, so that words are not treated uniquely just for case difference. Example: The word "school" will be written as "School" if present at the start of a sentence. But "School" and "school" mean the same thing.
- All punctuations are removed as they don't play a vital role in English (might not be the case for other languages)
- Stopwords are NOT removed as their consideration does matter here. A sentence translated with stopwords missing will be a badly translated sentence. Stopwords play a considerable role in the quality of translation.
- Words are lemmatized to reduce number of unique words for convinience in training.

In [3]:
lemmatizer = nltk.stem.WordNetLemmatizer()

def preprocess(x):
    s = x.lower() # convert all to lower case for convenience
    s = re.sub(r'[^a-zA-Z0-9\s]+', '', s) # remove punctuation
    s = s.split(" ")
    s = [lemmatizer.lemmatize(word) for word in s ] # lemmatize to base words
    # stop words not removed as here they are part of language and their presence(along with location) does matter
    return s

preprocessed_sentences = df.apply(lambda x: preprocess(x))
preprocessed_sentences

0                                                      [hi]
1                                                     [run]
2                                                     [run]
3                                                     [who]
4                                                     [wow]
                                ...                        
175616    [topdown, economics, never, work, said, obama,...
175617    [a, carbon, footprint, is, the, amount, of, ca...
175618    [death, is, something, that, were, often, disc...
175619    [since, there, are, usually, multiple, website...
175620    [if, someone, who, doesnt, know, your, backgro...
Name: English, Length: 175621, dtype: object

### <u><b>Training the Word2Vec model:</b></u>

- `window` => The number of words on both left and right side to which the current word is to be related with (bigger -> better)
- `vector_size` => The number of `features` that the word embedding model will generate for relating the words (bigger -> better)
- `min_count` => Minimum number of words required to be present in a sentence for it to be considered for training

In [4]:
model = gensim.models.Word2Vec(window=10, vector_size=2000, min_count=2, workers=4)
model.build_vocab(preprocessed_sentences)
model.train(preprocessed_sentences, total_examples=model.corpus_count, epochs=model.epochs)

(3754136, 5410400)

In [5]:
print("Number of words in vocabulary: ", len(model.wv))
print("Size(number of features) of each word vector: ", model.vector_size) 

Number of words in vocabulary:  8810
Size(number of features) of each word vector:  2000


### <u><b>Some points to note:</b></u>

- The number of words in `translation` may not be equal to the `number of words` in `reference`. So; if we just put the 1D vectors of each of them together, the 2D matrix generated for `translation` may not have the same `number of columns` as the 2D matrix generated for `reference`. However, the `number of rows` in both matrices will be equal and they will be equal to the `number of features` word embedding generated.
- All words that are present in the `translation` may not even be present in the `reference` corpus's `vocabulary`, even if they are a part of the language(dictionary).
- Position of the word in the sentence needs to be emphasized. Some languages require specific parts of speech to be at beginning, some at middle, some at end. For example, in English the order is: `Subject-Verb-Object`, however, in Japanese the order is `Subject-Object-Verb`

### <u><b>Some test data used to compare the performances:</u></b>

In [6]:
reference = "I go to the school."
translation1 = "I go to school."
translation2 = "I go school."
translation3 = "I go."
translation4 = "I school the go."

# Pre-process all
reference = preprocess(reference)
translation1 = preprocess(translation1)
translation2 = preprocess(translation2)
translation3 = preprocess(translation3)
translation4 = preprocess(translation4)

### <u><b>Evaluation Logic 1:</b></u>

- To address the issue of varying number of columns, we can add up the feature values of all words thereby generating a single `n_feature x 1` vector for each of translation and reference
- Simple addition wouldn't account for the word's position, so take a weighted sum by position to give importance to position
- So; for each of translation and reference; find `sentence_vector = sum(position * word_vector)`
- Finally, compare the two resultant `n_feature x 1` dimensional `sentence_vectors` using various methods of comparing vectors. Here; `cosine similarity` used.

In [7]:
f_count = model.vector_size

def find_score_1(reference, translation):
    # Filter Out of Vocabulary words from translation
    tran = [word for word in translation if word in list(model.wv.index_to_key)]
    ref_val = np.zeros(f_count) 
    tran_val = np.zeros(f_count)
    for i in range(len(reference)):
        ref_val = ref_val + ((i+1)*model.wv[reference[i]])
    
    for i in range(len(tran)):
        tran_val = tran_val + ((i+1)*model.wv[tran[i]])

    cos_sim = np.dot(ref_val,tran_val)/(np.linalg.norm(ref_val)*np.linalg.norm(tran_val))
    return cos_sim


In [8]:
print("For translation1:")
print("BLEU Score: ", nltk.translate.bleu_score.sentence_bleu(reference,translation1))
print("Experimental Score: ", find_score_1(reference,translation1), "\n")

print("For translation2:")
print("BLEU Score: ", nltk.translate.bleu_score.sentence_bleu(reference,translation2))
print("Experimental Score: ", find_score_1(reference,translation2), "\n")

print("For translation3:")
print("BLEU Score: ", nltk.translate.bleu_score.sentence_bleu(reference,translation3))
print("Experimental Score: ", find_score_1(reference,translation3), "\n")

print("For translation4:")
print("BLEU Score: ", nltk.translate.bleu_score.sentence_bleu(reference,translation4))
print("Experimental Score: ", find_score_1(reference,translation4), "\n")

For translation1:
BLEU Score:  1.2882297539194154e-231
Experimental Score:  0.8900996736071406 

For translation2:
BLEU Score:  1.384292958842266e-231
Experimental Score:  0.7969980684799112 

For translation3:
BLEU Score:  1.5319719891192393e-231
Experimental Score:  0.5344254095203471 

For translation4:
BLEU Score:  1.2882297539194154e-231
Experimental Score:  0.8125226141641856 



The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


### <u>Observations for this evaluation Score:</u>

- When words are decreased, BLEU score tends to increase wrongly(due to reduction in number of non-matching n-grams), but new evaluation score decreases correctly.
- `translation4` is more scrambled up than `translation2` but BLEU score gives both the same score as both contain eqal number of common words w.r.t. `reference`. But new evaluation score correctly identifies the difference and gives the scrambled up version lesser score and less scrambled version more score.

### <u><b>Evaluation Logic 2:</b></u>

- To address both the issue of varying number of columns and the importance of position together, consider the missing columns in `translation`'s sentence vector as zeroes
- Finally, compare the two resultant `number of feaures x number of words in reference sentence` dimensional `sentence_vectors` using various methods of comparing vectors. Here; `cosine similarity` used. 

In [9]:
f_count = model.vector_size

def find_score_2(reference, translation):
    ref_vecs = []
    tran_vecs = []
    for i in range(len(reference)):
        ref_vecs.append(model.wv[reference[i]])
    
    for i in range(len(translation)):
        if translation[i] in reference:
            tran_vecs.append(model.wv[translation[i]])
        else:
            tran_vecs.append(np.zeros(f_count))

    while len(tran_vecs) < len(ref_vecs):
        tran_vecs.append(np.zeros(f_count))

    # ref_matrix = np.stack(ref_vecs, axis=0)
    # print(ref_matrix.shape)
    # tran_matrix = np.stack(tran_vecs, axis=0)
    # print(ref_matrix.shape)

    # Flatten
    ref_vec = np.hstack(ref_vecs)
    tran_vec = np.hstack(tran_vecs)
    
    cos_sim = np.dot(ref_vec,tran_vec)/(np.linalg.norm(ref_vec)*np.linalg.norm(tran_vec))
    return cos_sim


In [10]:
print("For translation1:")
print("BLEU Score: ", nltk.translate.bleu_score.sentence_bleu(reference,translation1))
print("Experimental Score: ", find_score_2(reference,translation1), "\n")

print("For translation2:")
print("BLEU Score: ", nltk.translate.bleu_score.sentence_bleu(reference,translation2))
print("Experimental Score: ", find_score_2(reference,translation2), "\n")

print("For translation3:")
print("BLEU Score: ", nltk.translate.bleu_score.sentence_bleu(reference,translation3))
print("Experimental Score: ", find_score_2(reference,translation3), "\n")

print("For translation4:")
print("BLEU Score: ", nltk.translate.bleu_score.sentence_bleu(reference,translation4))
print("Experimental Score: ", find_score_2(reference,translation4), "\n")

For translation1:
BLEU Score:  1.2882297539194154e-231
Experimental Score:  0.7842435191132129 

For translation2:
BLEU Score:  1.384292958842266e-231
Experimental Score:  0.6434229198926371 

For translation3:
BLEU Score:  1.5319719891192393e-231
Experimental Score:  0.7127484462590815 

For translation4:
BLEU Score:  1.2882297539194154e-231
Experimental Score:  0.23147301424872077 



### <u>Observations for this evaluation Score:</u>

-

### <u><b>Evaluation Logic 3:</b></u>

- For each word(word_vector), take sum of it's cosine similarities with adjacent words and store that in position of the word. Thus; constructing a vector for the entire sentence.
- Find similarity between the sentence vectors

In [11]:
f_count = model.vector_size

def find_score_3(reference, translation):
    ref_vec = []
    tran_vec = []
    for i in range(len(reference)):
        total_sim = 0
        if i-1 >= 0:
            wvec_1 = model.wv[reference[i-1]]
            wvec_2 = model.wv[reference[i]]
            cs = np.dot(wvec_1,wvec_2)/(np.linalg.norm(wvec_1)*np.linalg.norm(wvec_2))
            total_sim += cs
        if i+1 < len(reference):
            wvec_1 = model.wv[reference[i]]
            wvec_2 = model.wv[reference[i+1]]
            cs = np.dot(wvec_1,wvec_2)/(np.linalg.norm(wvec_1)*np.linalg.norm(wvec_2))
            total_sim += cs
        ref_vec.append(total_sim)
    
    for i in range(len(translation)):
        total_sim = 0
        if i-1 >= 0:
            wvec_1 = model.wv[translation[i-1]]
            wvec_2 = model.wv[translation[i]]
            cs = np.dot(wvec_1,wvec_2)/(np.linalg.norm(wvec_1)*np.linalg.norm(wvec_2))
            total_sim += cs
        if i+1 < len(translation):
            wvec_1 = model.wv[translation[i]]
            wvec_2 = model.wv[translation[i+1]]
            cs = np.dot(wvec_1,wvec_2)/(np.linalg.norm(wvec_1)*np.linalg.norm(wvec_2))
            total_sim += cs
        tran_vec.append(total_sim)

    while len(tran_vec) < len(ref_vec):
        tran_vec.append(0)

    # ref_matrix = np.stack(ref_vecs, axis=0)
    # print(ref_matrix.shape)
    # tran_matrix = np.stack(tran_vecs, axis=0)
    # print(ref_matrix.shape)

    # Flatten
    ref_vec = np.array(ref_vec)
    tran_vec = np.array(tran_vec)
    
    cos_sim = np.dot(ref_vec,tran_vec)/(np.linalg.norm(ref_vec)*np.linalg.norm(tran_vec))
    return cos_sim


In [12]:
print("For translation1:")
print("BLEU Score: ", nltk.translate.bleu_score.sentence_bleu(reference,translation1))
print("Experimental Score: ", find_score_3(reference,translation1), "\n")

print("For translation2:")
print("BLEU Score: ", nltk.translate.bleu_score.sentence_bleu(reference,translation2))
print("Experimental Score: ", find_score_3(reference,translation2), "\n")

print("For translation3:")
print("BLEU Score: ", nltk.translate.bleu_score.sentence_bleu(reference,translation3))
print("Experimental Score: ", find_score_3(reference,translation3), "\n")

print("For translation4:")
print("BLEU Score: ", nltk.translate.bleu_score.sentence_bleu(reference,translation4))
print("Experimental Score: ", find_score_3(reference,translation4), "\n")

For translation1:
BLEU Score:  1.2882297539194154e-231
Experimental Score:  0.8768803745721149 

For translation2:
BLEU Score:  1.384292958842266e-231
Experimental Score:  0.7817830949725829 

For translation3:
BLEU Score:  1.5319719891192393e-231
Experimental Score:  0.8196272693811495 

For translation4:
BLEU Score:  1.2882297539194154e-231
Experimental Score:  0.7913911273644312 



### <u>Observations for this evaluation Score:</u>

-

### <u><b>Moving Parts:</b></u>

- The accuracy of the word embedding model and scheme determine the word vectors. So; they play a crucial role.
    - Try different word embedding techniques: word2vec, fasttext, glove, see which performs best
    - Try experimenting with different parameters of each model
- How we use the word vectors for finding a evaluation score is important
    - Try other ways of finding similarity between vectors other than cosine similarity
    - Compare the results of different similarity finding mechanisms
    - Find other evaluation methods on 1D vectors as well as 2D matrices