<a href="https://colab.research.google.com/github/ARBML/adawat/blob/main/notebooks/DiaLex.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Disclaimer: This notebook is a modified version from https://github.com/UBC-NLP/dialex



# Evaluating Word Embeddings [Description]

*The favored method to compute the performance of our models is that of analogy tasks. The benchmark files utilized were generated by forming combinations of each w1 and w2 words from each relation in every dialect CSV file. The benchmark file consists of analogies, where lines are 4-tuples of words, split into sections by “: SECTION NAME” lines. Gensim library was used to evaluate the models and the accuracy was reported for each section separately, including an aggregate summary at the end. The chosen methodology to solving analogies is the vector offset (3CossAdd), which falls under the pair-based methods for solving analogies. A proportional analogy holds between two word pairs: a:a* :: b:b* (a is to a* as b is to b*) For example, Tokyo is to Japan as Paris is to France. With the pair-based methods, given a:a* :: b:?, the task is to find b*.  As a means of truly capturing a model’s accuracy, the final aggregated report includes a top-1, top-5, and top-10 accuracy result for each section, in which the b* word is searched for in the model’s top-k most similar words. Furthermore, the accuracies are computed over dummy4unknown=true, which produce zero accuracies for 4-tuples with out-of-vocabulary words and dummy4unknown=false, in which tuples are skipped entirely and not used in the evaluation.*


# Gensim Function Signature --> evaluate_word_analogies()

**evaluate_word_analogies(analogies, restrict_vocab=300000, case_insensitive=True, dummy4unknown=False)**
*Compute performance of the model on an analogy test set.*

    This is modern variant of accuracy(), see discussion on GitHub #1935.This method corresponds to the compute-accuracy script of the original C word2vec. See also Analogy (State of the art).

**Parameters**:	

- **analogies (str)** – Path to file, where lines are 4-tuples of words, split into sections by “: SECTION NAME” lines. See gensim/test/test_data/questions-words.txt as example.

- **restrict_vocab (int, optional)** – Ignore all 4-tuples containing a word not in the first restrict_vocab words. This may be meaningful if you’ve sorted the model vocabulary by descending frequency (which is standard in modern word embedding models).

- **case_insensitive (bool, optional)** – If True - convert all words to their uppercase form before evaluating the performance. Useful to handle case-mismatch between training tokens and words in the test set. In case of multiple case variants of a single word, the vector for the first occurrence (also the most frequent if vocabulary is sorted) is taken.

- **dummy4unknown (bool, optional)** – If True - produce zero accuracies for 4-tuples with out-of-vocabulary words. Otherwise, these tuples are skipped entirely and not used in the evaluation.

# Importing Libararies

In [1]:
import logging
import glob 
import pandas as pd 
import pickle
from smart_open import open
from gensim.models import Word2Vec
from gensim.models import KeyedVectors
from gensim.models import keyedvectors
import xlwt
import numpy as np

# Defining Constants

**MODEL_PATH** - Path to the pre-trained model

**BENCHMARKS_PATH** – Path to file, where lines are 4-tuples of words, split into sections by “: SECTION NAME” lines. See gensim/test/test_data/questions-words.txt as example.

**TOP_K** - The final aggregated report includes a top-1, top-5, and top-10 accuracy result for each section, in which the b word is searched for in the model’s top-k most similar words

**DUMMY4UKNOWN**  (bool, optional) – If True - produce zero accuracies for 4-tuples with out-of-vocabulary words. Otherwise, these tuples are skipped entirely and not used in the evaluation.


In [2]:
!git clone https://github.com/UBC-NLP/dialex

Cloning into 'dialex'...
remote: Enumerating objects: 22, done.[K
remote: Counting objects: 100% (22/22), done.[K
remote: Compressing objects: 100% (18/18), done.[K
remote: Total 22 (delta 6), reused 12 (delta 2), pack-reused 0[K
Unpacking objects: 100% (22/22), done.


In [24]:
MODEL_PATH = "full_grams_sg_100_twitter.mdl"
BENCHMARKS_PATH = "dialex/benchmarks/*.txt"
TOP_K = [1,5,10]
DUMMY4UKNOWN = [True,False]

# Evaluation + Writing to XLS Workbook

In [21]:
!wget https://bakrianoo.ewr1.vultrobjects.com/aravec/full_grams_sg_100_twitter.zip
!unzip full_grams_sg_100_twitter.zip

--2022-10-05 16:45:08--  https://bakrianoo.ewr1.vultrobjects.com/aravec/full_grams_sg_100_twitter.zip
Resolving bakrianoo.ewr1.vultrobjects.com (bakrianoo.ewr1.vultrobjects.com)... 108.61.0.122, 2001:19f0:0:22::100
Connecting to bakrianoo.ewr1.vultrobjects.com (bakrianoo.ewr1.vultrobjects.com)|108.61.0.122|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1132025304 (1.1G) [application/zip]
Saving to: ‘full_grams_sg_100_twitter.zip’


2022-10-05 16:45:13 (242 MB/s) - ‘full_grams_sg_100_twitter.zip’ saved [1132025304/1132025304]



In [25]:
# A logger file will be created as "logger_{modelname}.txt"
logging.basicConfig(filename="logger_"+MODEL_PATH.split("/")[-1].split(".")[0], format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

Pretrain a model

In [26]:
# Loading pre-trained model 

# model = KeyedVectors.load_word2vec_format(MODEL_PATH)

model = Word2Vec.load(MODEL_PATH)

# OR 
# model = KeyedVectors.load(MODEL_PATH)

# OR 
# model = Word2Vec.load_word2vec_format(MODEL_PATH,binary=True)

# OR 
#model = KeyedVectors.load_word2vec_format(fname=MODEL_PATH,fvocab=None,binary=True, encoding='utf-8', unicode_errors='ignore')

In [None]:
model.wv.evaluate_word_analogies()

In [None]:
style0 = xlwt.easyxf('font: name Times New Roman, bold on')
# Creating an excel workbook which will contains various spreadsheets, one for each dialect benchmark file
wb = xlwt.Workbook()

# Iterating over each benchmark file to compute accuracy
for benchmark in glob.iglob(BENCHMARKS_PATH): 
    
    # Creating a spreadsheet for every benchmark file
    ws = wb.add_sheet(benchmark.split("/")[-1].split(".")[0])

########################### XLS Formatting ##############################################################

    ws.write(0, 0, "OOV Penalty", style0)
    for i in range(1,4):
        ws.write(0, i, "false", style0)
    for i in range(4,7):
        ws.write(0, i, "true", style0)

    ws.write(1, 0, "Relation / Top-K", style0)
    
    
    ws.write(1, 1, "Top-1", style0)
    ws.write(1, 2, "Top-5", style0)
    ws.write(1, 3, "Top-10", style0)
    ws.write(1, 4, "Top-1", style0)
    ws.write(1, 5, "Top-5", style0)
    ws.write(1, 6, "Top-10", style0)
    
    ws.write(2, 0, "Double", style0)
    ws.write(3, 0, "Plural", style0)
    ws.write(4, 0, "Genitive Past Tense", style0)
    ws.write(5, 0, "Opposite", style0)
    ws.write(6, 0, "Comparative", style0)
    ws.write(7, 0, "Man-Woman", style0)
    ws.write(8, 0, "Total Accuracy", style0)

###########################################################################################################

    n_false = 1
    n_true = 4 
    
    # Computing accuracy over top-k most similar words
    for k in TOP_K:
        #model = keyedvectors.WordEmbeddingsKeyedVectors.load(MODEL_PATH)
        print(MODEL_PATH.split("/")[-1].split(".")[0] + "_" + benchmark.split("/")[-1].split(".")[0])
        
        # Computing accuracy over top-k most similar words when penalizing and not penalizing out-of-vocabulary words
        for dummy in DUMMY4UKNOWN:     
            print("Top",k,"---> dummy4unknown="+str(dummy))
            sections_accuracy = model.wv.evaluate_word_analogies(benchmark,topn=int(k),dummy4unknown=dummy)
            print(sections_accuracy,end="\n\n")
            
            if dummy==False:
                ws.write(2,n_false,"None") if sections_accuracy['double']==None else ws.write(2,n_false, (sections_accuracy['double']*100))
                ws.write(3,n_false,"None") if sections_accuracy['plural']==None else ws.write(3,n_false, (sections_accuracy['plural']*100))
                ws.write(4,n_false,"None") if sections_accuracy['genitive_past_tense']==None else ws.write(4,n_false, (sections_accuracy['genitive_past_tense']*100))
                ws.write(5,n_false,"None") if sections_accuracy['opposite']==None else ws.write(5,n_false, (sections_accuracy['opposite']*100))
                ws.write(6,n_false,"None") if sections_accuracy['comparative']==None else ws.write(6,n_false, (sections_accuracy['comparative']*100))
                ws.write(7,n_false,"None") if sections_accuracy['man_woman']==None else ws.write(7,n_false, (sections_accuracy['man_woman']*100))
                ws.write(8,n_false,"None") if sections_accuracy['total']==None else ws.write(8,n_false, (sections_accuracy['total']*100))
                n_false = n_false + 1
                
            if dummy==True:
                ws.write(2,n_true,"None") if sections_accuracy['double']==None else ws.write(2,n_true, (sections_accuracy['double']*100))
                ws.write(3,n_true,"None") if sections_accuracy['plural']==None else ws.write(3,n_true, (sections_accuracy['plural']*100))
                ws.write(4,n_true,"None") if sections_accuracy['genitive_past_tense']==None else ws.write(4,n_true, (sections_accuracy['genitive_past_tense']*100))
                ws.write(5,n_true,"None") if sections_accuracy['opposite']==None else ws.write(5,n_true, (sections_accuracy['opposite']*100))
                ws.write(6,n_true,"None") if sections_accuracy['comparative']==None else ws.write(6,n_true, (sections_accuracy['comparative']*100))
                ws.write(7,n_true,"None") if sections_accuracy['man_woman']==None else ws.write(7,n_true, (sections_accuracy['man_woman']*100))
                ws.write(8,n_true,"None") if sections_accuracy['total']==None else ws.write(8,n_true, (sections_accuracy['total']*100))
                n_true = n_true + 1        
    print(50*"*")
    
# Saving workbook as : "{modelname}.xls"
wb.save(MODEL_PATH.split("/")[-1].split(".")[0]+".xls") 