![MSE Logo](https://moodle.msengineering.ch/pluginfile.php/1/core_admin/logo/0x150/1643104191/logo-mse.png)

# AdvNLP Lab 4 GRADED: Testing a pretrained word2vec model on analogy tasks

**Objectives:**  experiment with *word vectors* from word2vec: test them on analogy tasks; use *accuracy and MRR* (Mean Reciprocal Rank) scores.

**Useful documentation:** the [section on KeyedVectors in Gensim](https://radimrehurek.com/gensim/models/keyedvectors.html) and possibly the [section on word2vec](https://radimrehurek.com/gensim/models/word2vec.html).

## 1. Word2vec model trained on Google News
**1a.** Please install the latest version of Gensim, preferably in a Conda environment. 

In [1]:
# !pip install --upgrade gensim
# You can run the following verification:
!pip show gensim

Name: gensim
Version: 4.3.3
Summary: Python framework for fast Vector Space Modelling
Home-page: https://radimrehurek.com/gensim/
Author: Radim Rehurek
Author-email: me@radimrehurek.com
License: LGPL-2.1-only
Location: /Users/jaron/workspace/mse_advnlp/.venv/lib/python3.12/site-packages
Requires: numpy, scipy, smart-open
Required-by: 


In [2]:
import gensim, os, random
from gensim import downloader
from gensim.test.utils import datapath
from gensim.models import KeyedVectors
from gensim import utils
# help(gensim.models.word2vec) # take a look if needed
import time
import itertools
import psutil
import logging

**1b.** Please download from Gensim the `word2vec-google-news-300` model, upon your first use.  Then, please write code to answer the following questions:
* Where is the model stored on your computer and what is the file name?  You can store the absolute path in a variable called `path_to_model_file`.
* What is the size of the corresponding file?  Please display the size in gigabytes with two decimals.

In [3]:
# Download the model from Gensim (needed only the first time)
# gensim.downloader.load("word2vec-google-news-300")
# No need to store the returned value (uses a lot of memory).

In [4]:
# Please write your Python code below and execute it.
def find_word2vec_model(start_dir='/', filename='word2vec-google-news-300.gz'):
    for root, dirs, files in os.walk(start_dir):
        if filename in files:
            return os.path.join(root, filename)
    return None

path_to_model_file = find_word2vec_model(start_dir=os.path.expanduser('~'))

if path_to_model_file:
    file_size = os.path.getsize(path_to_model_file)
    print(f'File path: {path_to_model_file}')
    print(f'File size: {file_size / (1024 * 1024 * 1024):.2f} GB')
else:
    print(f'File not found.')

File path: /Users/jaron/gensim-data/word2vec-google-news-300/word2vec-google-news-300.gz
File size: 1.62 GB


**1c.** Please load the word2vec model as an instance of the class `KeyedVectors`, and store it in a variable called `wv_model`. 
What is, at this point, the memory size of the process corresponding to this notebook?  Simply write the value you obtain from any OS-specific utility that you like.

In [5]:
# Please write your Python code below and execute it.  Write the memory size on a commented line.
try:
    wv_model = KeyedVectors.load_word2vec_format(path_to_model_file, binary=True)
    print('Model loaded succesfully')
except Exception as e:
    print('Error loading model: {e}.')

process = psutil.Process(os.getpid())
mem_usage_mb = process.memory_info().rss / (1024 * 1024 * 1024)

print(f'Memory usage: {mem_usage_mb:.2f} GB')

Model loaded succesfully
Memory usage: 3.80 GB


**1d.** Please write the instructions that generate the answers to the following questions.
* What is the size of the vocabulary of the `wv_model` model?  
* What is the dimensionality of each word vector?  
* What is the word corresponding to the vector in position 1234?  
* What are the first 10 coefficients of the word vector for the word *pyramid*?  

In [6]:
# Please write your Python code below and execute it.
print(f'Vocab size: {len(wv_model.key_to_index):,}')
print(f'Dimension of each word vector: {wv_model.vector_size}')
print(f'Word corresponding to idx=1234: {wv_model.index_to_key[1234]}')
print(f'The first 10 coefficients for the word pyramid are:\n{wv_model.get_vector("pyramid")[:10]}')

Vocab size: 3,000,000
Dimension of each word vector: 300
Word corresponding to idx=1234: learn
The first 10 coefficients for the word pyramid are:
[ 0.00402832 -0.00260925  0.04296875  0.19433594 -0.03979492 -0.06445312
  0.42773438 -0.18359375 -0.27148438 -0.12890625]


## 2. Solving analogies using word2vec trained on Google News
In this section, you are going to use word vectors to solve analogy tasks provided with Gensim, such as "What is to France what Rome is to Italy?".  The predefined function in Gensim that evaluates a model on this task does not provide enough details, so you will need to make modifications to it.

**2a.** The analogy tasks are stored in a text file called `questions-words.txt` which is typically found in `C:\Users\YourNameHere\.conda\envs\YourEnvNameHere\Lib\site-packages\gensim\test\test_data`.  You can access it from here with Gensim as `datapath('questions-words.txt')`.  

Please create a file called `questions-words-100.txt` with the first 100 lines from the original file.  Please run the evaluation task on this file, using the [documentation of the KeyedVectors class](https://radimrehurek.com/gensim/models/keyedvectors.html), then answer the following questions:
* How many analogy tasks are there in your `questions-words-100.txt` file?
* How many analogies were solved correctly and how many incorrectly?
* What is the accuracy returned by `evaluate_word_analogies`?
* How much time did it take to solve the analogies?

In [7]:
# Please write your Python code below and execute it.
file_path = datapath('questions-words.txt')
new_file_path = datapath('questions-words-100.txt')

with open(file_path, 'r') as file:
    first_100_lines = [file.readline() for _ in range(100)]

with open(new_file_path, 'w') as new_file:
    new_file.writelines(first_100_lines)

print(f'The first 100 lines were sucessfully written to {new_file_path}')

The first 100 lines were sucessfully written to /Users/jaron/workspace/mse_advnlp/.venv/lib/python3.12/site-packages/gensim/test/test_data/questions-words-100.txt


In [8]:
start_time = time.time()
analogy_scores = wv_model.evaluate_word_analogies(datapath('questions-words-100.txt'))
end_time = time.time()
elapsed_time = end_time - start_time

In [None]:
correct_analogies = len(analogy_scores[1][-1].get('correct'))
incorrect_analogies = len(analogy_scores[1][-1].get('incorrect'))
all_analogies = correct_analogies + incorrect_analogies

print(f'Total analogies: {all_analogies}')
print(f'Correct analogies: {correct_analogies}')
print(f'Incorrect analogies: {incorrect_analogies}')
print(f'Accuracy: {analogy_scores[0]:.2f}')
print(f'It took the algorithm {elapsed_time:.2f} seconds to solve all {all_analogies} analogies')

Total analogies: 99
Correct analogies: 80
Incorrect analogies: 19
Accuracy: 0.8081
It took the algorithm 4.45 seconds to solve all 99 analogies


**2b.** Please answer in writing the following questions:
* What is the meaning of the first line of `questions-words-100.txt`?
* How many analogies are there in the original `questions-words.txt`?
* How much time would it take to solve the original set of analogies?

In [10]:
# Please write your answers here.
with open(datapath('questions-words-100.txt')) as file:
    print(file.readline())

# Q: What is the meaning of the first line of 'questions-words-100.txt?
# A: It indicates that the analogies in this section relate to the capitals of common countries.

: capital-common-countries



In [11]:
with open(file_path, 'r') as file:
    num_analogies = sum(1 for line in file if not line.startswith(':'))

estimated_time = (elapsed_time / all_analogies * num_analogies)
print(f'There are {num_analogies:,} analogies in "questions-words.txt".')
print(f'It would approx. take {(estimated_time / 60):.2f} minutes to solve all analogies')

There are 19,544 analogies in "questions-words.txt".
It would approx. take 14.64 minutes to solve all analogies


**2c.** The built-in function from Gensim has several weaknesses, which you will address here.  Please copy the source code of the function `evaluate_word_analogies` from the file `gensim\models\keyedvectors.py` and create here a new function which will improve the built-in one as follows.  The function will be called `my_evaluate_word_analogies` and you will also pass it the model as the first argument.  Overall, please proceed gradually and only make minimal modifications, to ensure you don't break the function.  It is important to first understand the structure of the result, `analogies_scores` and `sections`. 

* Modify the line where `section[incorrect]` is assembled in order to also add to each analogy the *incorrect guess* (i.e. what the model thought was the good answer, but got it wrong).

* Modify the code so that when `section[incorrect]` is assembled, you also add the *rank of the correct answer* among the candidates returned by the system (after the incorrect guess).  If the correct answer is not present at all, then code the rank as 0.

In [12]:
logger = logging.getLogger(__name__)

def my_evaluate_word_analogies(model, analogies, restrict_vocab=300000, case_insensitive=True):
    dummy4unknown = False

    ok_keys = model.index_to_key[:restrict_vocab]
    if case_insensitive:
        ok_vocab = {k.upper(): model.get_index(k) for k in reversed(ok_keys)}
    else:
        ok_vocab = {k: model.get_index(k) for k in reversed(ok_keys)}
    oov = 0
    logger.info("Evaluating word analogies for top %i words in the model on %s", restrict_vocab, analogies)
    sections, section = [], None
    quadruplets_no = 0
    with utils.open(analogies, 'rb') as fin:
        for line_no, line in enumerate(fin):
            line = utils.to_unicode(line)
            if line.startswith(': '):
                # a new section starts => store the old section
                if section:
                    sections.append(section)
                    model._log_evaluate_word_analogies(section)
                section = {'section': line.lstrip(': ').strip(), 'correct': [], 'incorrect': []}
            else:
                if not section:
                    raise ValueError("Missing section header before line #%i in %s" % (line_no, analogies))
                try:
                    if case_insensitive:
                        a, b, c, expected = [word.upper() for word in line.split()]
                    else:
                        a, b, c, expected = [word for word in line.split()]
                except ValueError:
                    logger.info("Skipping invalid line #%i in %s", line_no, analogies)
                    continue
                quadruplets_no += 1
                if a not in ok_vocab or b not in ok_vocab or c not in ok_vocab or expected not in ok_vocab:
                    oov += 1
                    if dummy4unknown:
                        logger.debug('Zero accuracy for line #%d with OOV words: %s', line_no, line.strip())
                        section['incorrect'].append((a, b, c, expected))
                    else:
                        logger.debug("Skipping line #%i with OOV words: %s", line_no, line.strip())
                    continue
                original_key_to_index = model.key_to_index
                model.key_to_index = ok_vocab
                ignore = {a, b, c}  # input words to be ignored
                predicted = None
                # find the most likely prediction using 3CosAdd (vector offset) method

                sims = model.most_similar(positive=[b, c], negative=[a], topn=5, restrict_vocab=restrict_vocab)
                model.key_to_index = original_key_to_index
                for element in sims:
                    predicted = element[0].upper() if case_insensitive else element[0]
                    if predicted in ok_vocab and predicted not in ignore:
                        if predicted != expected:
                            logger.debug("%s: expected %s, predicted %s", line.strip(), expected, predicted)
                        break
                if predicted == expected:
                    section['correct'].append((a, b, c, expected))
                else:
                    rank = next((i+1 for i, (word, _) in enumerate(sims) if (word.upper() if case_insensitive else word) == expected), 0)
                    section['incorrect'].append((a, b, c, expected, predicted, rank))
    if section:
        # store the last section, too
        sections.append(section)
        model._log_evaluate_word_analogies(section)

    total = {
        'section': 'Total accuracy',
        'correct': list(itertools.chain.from_iterable(s['correct'] for s in sections)),
        'incorrect': list(itertools.chain.from_iterable(s['incorrect'] for s in sections)),
    }

    oov_ratio = float(oov) / quadruplets_no * 100
    logger.info('Quadruplets with out-of-vocabulary words: %.1f%%', oov_ratio)
    if not dummy4unknown:
        logger.info(
            'NB: analogies containing OOV words were skipped from evaluation! '
            'To change this behavior, use "dummy4unknown=True"'
        )
    analogies_score = model._log_evaluate_word_analogies(total)
    sections.append(total)
    # Return the overall score and the full lists of correct and incorrect analogies
    return analogies_score, sections

**2d.** Please run the `my_evaluate_word_analogies` function on `questions-words-100.txt` and then write instructions to display, from the results stored in `analogy_scores`:
* one incorrectly-solved analogy (selected at random), including also the error made by the model and the rank of the correct answer, thus adding:
  - a fifth word, which is the incorrect one found by the model
  - a sixth term, which is the integer indicating the rank (or 0)
* one correctly-solved analogy selected at random (in principle, four terms).

In [13]:
# Please write your Python code below and execute it.
analogy_scores = my_evaluate_word_analogies(wv_model, datapath('questions-words-100.txt'))

In [14]:
incorrect_sample = random.choice(analogy_scores[1][-1].get('incorrect'))
print(f'A random incorrectly-solved analogy is: {incorrect_sample}')

A random incorrectly-solved analogy is: ('BANGKOK', 'THAILAND', 'LONDON', 'ENGLAND', 'BRITAIN', 0)


In [15]:
correct_sample = random.choice(analogy_scores[1][-1].get('correct'))
print(f'A random correctly-solved analogy is: {correct_sample}')

A random correctly-solved analogy is: ('BEIJING', 'CHINA', 'TOKYO', 'JAPAN')


**2e.** Please write a function to compute the MRR score given a structure with correctly and incorrectly solved analogies, such as the one that is found in the results from `evaluate_word_analogies`.  The structure is not divided into categories.

The Mean Reciprocal Rank (please use the [formula here](https://en.wikipedia.org/wiki/Mean_reciprocal_rank)) gives some credit for incorrectly solved analogies, in inverse proportion to the rank of the correct answer among the candidates.  This rank is 1 for correctly solved analogies (full credit), and 1/k (or 0) for incorrectly solved ones.

In [16]:
# Please define here the function that computes MRR from the information stored in analogy_scores
def myMRR(analogies):
    correct = analogies.get('correct', [])
    incorrect = analogies.get('incorrect', [])

    n_corr = len(correct)
    n_incorr = len(incorrect)

    rep_rank = sum(1 / sample[-1] for sample in incorrect if sample[-1] >= 1)
    
    rep_rank += n_corr

    return rep_rank / (n_corr + n_incorr)

In [None]:
# Please test your MRR function by running the following code, which  displays the total number of analogy tasks, 
# the number of different categories (sections), the accuracy of the results (total number of correctly 
# solved analogies), and the MRR score of the results:
print(f'Total number of analogies: {len(analogy_scores[1][-1].get("correct")) + len(analogy_scores[1][-1].get("incorrect"))}')  # The last dictionary is the total
print(f'Total number of categories: {len(analogy_scores[1]) - 1}') # the "total" is excluded 
print(f'Overall accuracy: {analogy_scores[0]:.2f} and MRR: {myMRR(analogy_scores[1][-1]):.2f}')

Total number of analogies: 99
Total number of categories: 1
Overall accuracy: 0.81 and MRR: 0.86


**2f.** Please compute now the accuracy and MRR and the total time for the entire `questions-words.txt` file.  Is the timing compatible with your estimate from (2b)?  What do you think about the difference between accuracy and MRR? 

In [18]:
# Please write your Python code below and execute it.
start_time = time.time()
analogy_scores = my_evaluate_word_analogies(wv_model, datapath('questions-words.txt'))
end_time = time.time()
actual_time = end_time - start_time

In [20]:
# Please write you answer here.
all_analogies = len(analogy_scores[1][-1].get('correct')) + len(analogy_scores[1][-1].get('incorrect'))

print(f'It took the algorithm {(actual_time/60):.2f} minutes to solve all {all_analogies:,} analogies')
print(f'With an accuracy of {analogy_scores[0]:.2f} and MRR of {myMRR(analogy_scores[1][-1]):.2f}')
print()
print(f'1. No, our estimate was off by {(estimated_time - actual_time):.2f} seconds. We estimated that the algorithm would take {estimated_time:.2f} seconds. However, the algorithm only took {actual_time:.2f} seconds to process all the analogies.')
print('2. Accuracy is a strict metric that only considers the top prediction as correct, whereas MRR also considers the rank if the correct answer is in the top 5 predictions (in our case), giving higher scores to answers that are closer to the top.')

It took the algorithm 3.67 minutes to solve all 19,330 analogies
With an accuracy of 0.74 and MRR of 0.79

1. No, our estimate was off by 658.22 seconds. We estimated that the algorithm would take 878.64 seconds. However, the algorithm only took 220.43 seconds to process all the analogies.
2. Accuracy is a strict metric that only considers the top prediction as correct, whereas MRR also considers the rank if the correct answer is in the top 5 predictions (in our case), giving higher scores to answers that are closer to the top.


## End of AdvNLP Lab 4
Please submit your lab report as a .ipynb file after you have fully run and checked it in Google Colab; then upload it to Moodle.
Please submit one notebook per group only and do not forget to put the last names of all team members in the filename.