![MSE Logo](https://moodle.msengineering.ch/pluginfile.php/1/core_admin/logo/0x150/1643104191/logo-mse.png)

# AdvNLP Lab 4 GRADED: Testing a pretrained word2vec model on analogy tasks

**Objectives:**  experiment with *word vectors* from word2vec: test them on analogy tasks; use *accuracy and MRR* (Mean Reciprocal Rank) scores.

**Useful documentation:** the [section on KeyedVectors in Gensim](https://radimrehurek.com/gensim/models/keyedvectors.html) and possibly the [section on word2vec](https://radimrehurek.com/gensim/models/word2vec.html).

## 1. Word2vec model trained on Google News
**1a.** Please install the latest version of Gensim, preferably in a Conda environment. 

In [None]:
pip install gensim

In [1]:
# !pip install --upgrade gensim
# You can run the following verification:
!pip show gensim

Name: gensim
Version: 4.3.3
Summary: Python framework for fast Vector Space Modelling
Home-page: https://radimrehurek.com/gensim/
Author: Radim Rehurek
Author-email: me@radimrehurek.com
License: LGPL-2.1-only
Location: /home/shilpi/Documents/sem3/Adv_NLP/.venv/lib/python3.12/site-packages
Requires: numpy, scipy, smart-open
Required-by: 


In [None]:
import gensim, os, random
from gensim import downloader
from gensim.test.utils import datapath
from gensim.models import KeyedVectors
from gensim import utils
# help(gensim.models.word2vec) # take a look if needed
import time
import itertools

import psutil

**1b.** Please download from Gensim the `word2vec-google-news-300` model, upon your first use.  Then, please write code to answer the following questions:
* Where is the model stored on your computer and what is the file name?  You can store the absolute path in a variable called `path_to_model_file`.
* What is the size of the corresponding file?  Please display the size in gigabytes with two decimals.

In [None]:
pip install torch


In [6]:
import torch

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

# Optionally, if a GPU is available, print its name:
if device.type == 'cuda':
    print(f"GPU Name: {torch.cuda.get_device_name(0)}")


Using device: cuda
GPU Name: NVIDIA TITAN RTX


In [3]:
# Download the model from Gensim (needed only the first time)
# gensim.downloader.load("word2vec-google-news-300")
# No need to store the returned value (uses a lot of memory).

In [3]:
# Please write your Python code below and execute it.
model = gensim.downloader.load("word2vec-google-news-300")



In [27]:
# store the path in variable
path_to_model_file = '/home/shilpi/gensim-data/word2vec-google-news-300/word2vec-google-news-300.gz'

In [29]:
# Get the file size in bytes
file_size_bytes  = os.path.getsize(path_to_model_file)
print(file_size_bytes)

1743563840


In [30]:

# Convert bytes to gigabytes (1 GB = 1024^3 bytes)
file_size_gb = file_size_bytes / (1024 ** 3)

# Print the file size in gigabytes with two decimal places
print(f"Model file size: {file_size_gb:.2f} GB")

Model file size: 1.62 GB


**1c.** Please load the word2vec model as an instance of the class `KeyedVectors`, and store it in a variable called `wv_model`. 
What is, at this point, the memory size of the process corresponding to this notebook?  Simply write the value you obtain from any OS-specific utility that you like.

In [None]:
# Please write your Python code below and execute it.  Write the memory size on a commented line.
# Load the Word2Vec model
wv_model = KeyedVectors.load_word2vec_format(
    path_to_model_file, 
    binary=True
)

print("Model loaded successfully!")

process = psutil.Process(os.getpid())  # Get current process info
memory_usage_gb = process.memory_info().rss / (1024 ** 3)  # Convert bytes to GB

print(f"Memory used by this notebook: {memory_usage_gb:.2f} GB")


Model loaded successfully!
Memory used by this notebook: 11.59 GB


**1d.** Please write the instructions that generate the answers to the following questions.
* What is the size of the vocabulary of the `wv_model` model?  
* What is the dimensionality of each word vector?  
* What is the word corresponding to the vector in position 1234?  
* What are the first 10 coefficients of the word vector for the word *pyramid*?  

In [36]:
# Please write your Python code below and execute it.
# size of the vocabulary of the `wv_model` model

vocab_size = len(wv_model)
print(f"Vocabulary size: {vocab_size}")


Vocabulary size: 3000000


In [37]:
# the dimensionality of each word vector

vector_dim = wv_model.vector_size
print(f"Dimensionality of each word vector: {vector_dim}")


Dimensionality of each word vector: 300


In [38]:
# the word corresponding to the vector in position 1234
word_at_1234 = wv_model.index_to_key[1234]
print(f"Word at position 1234: {word_at_1234}")


Word at position 1234: learn


In [39]:
# the first 10 coefficients of the word vector for the word *pyramid*

word = "pyramid"

if word in wv_model:
    vector = wv_model[word]  # Get the word vector
    first_10_coefficients = vector[:10]  # Extract the first 10 coefficients
    print(f"First 10 coefficients of the word vector for '{word}':\n{first_10_coefficients}")
else:
    print(f"'{word}' is not in the vocabulary.")


First 10 coefficients of the word vector for 'pyramid':
[ 0.00402832 -0.00260925  0.04296875  0.19433594 -0.03979492 -0.06445312
  0.42773438 -0.18359375 -0.27148438 -0.12890625]


## 2. Solving analogies using word2vec trained on Google News
In this section, you are going to use word vectors to solve analogy tasks provided with Gensim, such as "What is to France what Rome is to Italy?".  The predefined function in Gensim that evaluates a model on this task does not provide enough details, so you will need to make modifications to it.

**2a.** The analogy tasks are stored in a text file called `questions-words.txt` which is typically found in `C:\Users\YourNameHere\.conda\envs\YourEnvNameHere\Lib\site-packages\gensim\test\test_data`.  You can access it from here with Gensim as `datapath('questions-words.txt')`.  

Please create a file called `questions-words-100.txt` with the first 100 lines from the original file.  Please run the evaluation task on this file, using the [documentation of the KeyedVectors class](https://radimrehurek.com/gensim/models/keyedvectors.html), then answer the following questions:
* How many analogy tasks are there in your `questions-words-100.txt` file?
* How many analogies were solved correctly and how many incorrectly?
* What is the accuracy returned by `evaluate_word_analogies`?
* How much time did it take to solve the analogies?

In [41]:
# Please write your Python code below and execute it.
# to save first 100 line in other file

# Define file paths
input_file = "/home/shilpi/Documents/sem3/Adv_NLP/.venv/lib/python3.12/site-packages/gensim/test/test_data/questions-words.txt"
output_file = "questions_words-100.txt"  # New file to save the output

# Read and save the first 100 lines
with open(input_file, "r") as infile, open(output_file, "w") as outfile:
    for _ in range(100):
        line = infile.readline()
        if not line:
            break  # Stop if there are fewer than 100 lines
        outfile.write(line)  # Write the line to the new file

print(f"First 100 lines saved to: {output_file}")



First 100 lines saved to: questions_words-100.txt


In [55]:
# evaluations task 
# Record start time
start_time = time.time()

analogy_scores, sections = wv_model.evaluate_word_analogies('questions_words-100.txt')

# Record end time
end_time = time.time()

# accuracy returened by evalute_word_analogies
analogy_scores


0.8080808080808081

In [52]:
sections

[{'section': 'capital-common-countries',
  'correct': [('ATHENS', 'GREECE', 'BANGKOK', 'THAILAND'),
   ('ATHENS', 'GREECE', 'BEIJING', 'CHINA'),
   ('ATHENS', 'GREECE', 'BERLIN', 'GERMANY'),
   ('ATHENS', 'GREECE', 'BERN', 'SWITZERLAND'),
   ('ATHENS', 'GREECE', 'CAIRO', 'EGYPT'),
   ('ATHENS', 'GREECE', 'CANBERRA', 'AUSTRALIA'),
   ('ATHENS', 'GREECE', 'HAVANA', 'CUBA'),
   ('ATHENS', 'GREECE', 'HELSINKI', 'FINLAND'),
   ('ATHENS', 'GREECE', 'ISLAMABAD', 'PAKISTAN'),
   ('ATHENS', 'GREECE', 'MADRID', 'SPAIN'),
   ('ATHENS', 'GREECE', 'MOSCOW', 'RUSSIA'),
   ('ATHENS', 'GREECE', 'OSLO', 'NORWAY'),
   ('ATHENS', 'GREECE', 'OTTAWA', 'CANADA'),
   ('ATHENS', 'GREECE', 'PARIS', 'FRANCE'),
   ('ATHENS', 'GREECE', 'ROME', 'ITALY'),
   ('ATHENS', 'GREECE', 'STOCKHOLM', 'SWEDEN'),
   ('ATHENS', 'GREECE', 'TEHRAN', 'IRAN'),
   ('ATHENS', 'GREECE', 'TOKYO', 'JAPAN'),
   ('BAGHDAD', 'IRAQ', 'BANGKOK', 'THAILAND'),
   ('BAGHDAD', 'IRAQ', 'BEIJING', 'CHINA'),
   ('BAGHDAD', 'IRAQ', 'BERLIN', 'GERMA

In [56]:
# Calculate the time taken
time_taken = end_time - start_time

print(f"Time taken to solve the analogies: {time_taken:.2f} seconds")

Time taken to solve the analogies: 2.37 seconds


In [60]:

# Count analogy tasks
num_tasks = 0

with open(output_file, "r") as file:
    for line in file:
        if not line.startswith(":"):  # Ignore category headers
            num_tasks += 1

print(f"Total number of analogy tasks: {num_tasks}")


Total number of analogy tasks: 99


In [51]:
# Count correct and incorrect analogies
correct = sum(len(section["correct"]) for section in sections)
incorrect = sum(len(section["incorrect"]) for section in sections)

print(f"Correctly solved analogies: {correct}")
print(f"Incorrectly solved analogies: {incorrect}")


Correctly solved analogies: 160
Incorrectly solved analogies: 38


**2b.** Please answer in writing the following questions:
* What is the meaning of the first line of `questions-words-100.txt`?
* How many analogies are there in the original `questions-words.txt`?
* How much time would it take to solve the original set of analogies?

### What is the meaning of the first line of `questions-words-100.txt`?

The first line of the questions-words-100.txt file is typically a category header that indicates the type of analogy questions that follow.  
': capital-common-countries' refers to a category of analogy tasks where the analogy is between capital cities of countries.  

Example analogy: "Paris is to France as Berlin is to Germany."

In [59]:
# Please write your answers here.
# analogies in the original `questions-words.txt`

# Count analogy tasks
num_tasks_original = 0

with open(input_file, "r") as file:
    for line in file:
        if not line.startswith(":"):  # Ignore category headers
            num_tasks_original += 1

print(f"Total number of analogy tasks in original file: {num_tasks_original}")


Total number of analogy tasks in original file: 19544


In [72]:
# evaluations task on original file
# Record start time
start_time_og = time.time()

analogy_scores_og, sections_og = wv_model.evaluate_word_analogies(input_file)

# Record end time
end_time_og = time.time()

# accuracy returened by evalute_word_analogies
print(f'Analogy score: {analogy_scores_og:.4f}')


Analogy score: 0.7401


In [64]:
# Calculate the time taken
time_taken_og = end_time_og - start_time_og

print(f"Time taken to solve the analogies: {time_taken_og/60:.2f} minutes")

Time taken to solve the analogies: 6.83 minutes


**2c.** The built-in function from Gensim has several weaknesses, which you will address here.  Please copy the source code of the function `evaluate_word_analogies` from the file `gensim\models\keyedvectors.py` and create here a new function which will improve the built-in one as follows.  The function will be called `my_evaluate_word_analogies` and you will also pass it the model as the first argument.  Overall, please proceed gradually and only make minimal modifications, to ensure you don't break the function.  It is important to first understand the structure of the result, `analogies_scores` and `sections`. 

* Modify the line where `section[incorrect]` is assembled in order to also add to each analogy the *incorrect guess* (i.e. what the model thought was the good answer, but got it wrong).

* Modify the code so that when `section[incorrect]` is assembled, you also add the *rank of the correct answer* among the candidates returned by the system (after the incorrect guess).  If the correct answer is not present at all, then code the rank as 0.

In [66]:
# def my_evaluate_word_analogies(model, analogies, restrict_vocab=300000, case_insensitive=True):
def my_evaluate_word_analogies(
            wv_model, analogies, restrict_vocab=300000, case_insensitive=True):
    
    """
    analogies : str
        Path to file, where lines are 4-tuples of words, split into sections by ": SECTION NAME" lines.
        See `gensim/test/test_data/questions-words.txt` as example.
    restrict_vocab : int, optional
        Ignore all 4-tuples containing a word not in the first `restrict_vocab` words.
        This may be meaningful if you've sorted the model vocabulary by descending frequency (which is standard
        in modern word embedding models).
    case_insensitive : bool, optional
        If True - convert all words to their uppercase form before evaluating the performance.
        Useful to handle case-mismatch between training tokens and words in the test set.
        In case of multiple case variants of a single word, the vector for the first occurrence
        (also the most frequent if vocabulary is sorted) is taken.
    

    Returns
    -------
    score : float
        The overall evaluation score on the entire evaluation set
        sections : list of dict of {str : str or list of tuple of (str, str, str, str)}
        Results broken down by each section of the evaluation set. Each dict contains the name of the section
        under the key 'section', and lists of correctly and incorrectly predicted 4-tuples of words under the
        keys 'correct' and 'incorrect'.

    """
    ok_keys = wv_model.index_to_key[:restrict_vocab]

    if case_insensitive:
        ok_vocab = {k.upper(): wv_model.get_index(k) for k in reversed(ok_keys)}
    else:
        ok_vocab = {k: wv_model.get_index(k) for k in reversed(ok_keys)}
    oov = 0

    sections, section = [], None
    quadruplets_no = 0

    with utils.open(analogies, 'rb') as fin:
        for line_no, line in enumerate(fin):
            line = utils.to_unicode(line)

            if line.startswith(': '):
                if section:
                    sections.append(section)
                section = {'section': line.lstrip(': ').strip(), 'correct': [], 'incorrect': []}
            else:
                if not section:
                    raise ValueError("Missing section header before line #%i in %s" % (line_no, analogies))
                
                try:
                    if case_insensitive:
                        a, b, c, expected = [word.upper() for word in line.split()]
                    else:
                        a, b, c, expected = [word for word in line.split()]
                except ValueError:
                    continue

                quadruplets_no += 1
                if a not in ok_vocab or b not in ok_vocab or c not in ok_vocab or expected not in ok_vocab:
                    oov += 1
                    section['incorrect'].append((a, b, c, expected))
                    continue

                original_key_to_index = wv_model.key_to_index
                wv_model.key_to_index = ok_vocab

                ignore = {a, b, c}  # input words to be ignored
                predicted = None
                rank = 0

                sims = wv_model.most_similar(positive=[b, c], negative=[a], topn=5, restrict_vocab=restrict_vocab)
                wv_model.key_to_index = original_key_to_index

                for i, element in enumerate(sims, start=1):
                    candidate = element[0].upper() if case_insensitive else element[0]
                    if candidate in ok_vocab and candidate not in ignore:
                        predicted = candidate
                        if predicted == expected:
                            rank = i
                        break

                if predicted == expected:
                    section['correct'].append((a, b, c, expected))
                else:
                    correct_rank = next((i for i, elem in enumerate(sims, start=1) if elem[0] == expected), 0)
                    section['incorrect'].append((a, b, c, expected, predicted, correct_rank))
    if section:
        # store the last section, too
        sections.append(section)

    total = {
        'section': 'Total accuracy',
        'correct': list(itertools.chain.from_iterable(s['correct'] for s in sections)),
        'incorrect': list(itertools.chain.from_iterable(s['incorrect'] for s in sections)),
    }

    oov_ratio = float(oov) / quadruplets_no * 100 if quadruplets_no > 0 else 0

    analogies_score = len(total['correct']) / (len(total['correct']) + len(total['incorrect'])) if (len(total['correct']) + len(total['incorrect'])) > 0 else 0
   
    sections.append(total)
    # Return the overall score and the full lists of correct and incorrect analogies
    return analogies_score, sections

**2d.** Please run the `my_evaluate_word_analogies` function on `questions-words-100.txt` and then write instructions to display, from the results stored in `analogy_scores`:
* one incorrectly-solved analogy (selected at random), including also the error made by the model and the rank of the correct answer, thus adding:
  - a fifth word, which is the incorrect one found by the model
  - a sixth term, which is the integer indicating the rank (or 0)
* one correctly-solved analogy selected at random (in principle, four terms).

In [74]:
# Please write your Python code below and execute it.
# running my_evaluate_word_analogies function on output text file

scores, results = my_evaluate_word_analogies(wv_model, output_file)

# Print results
print(f"Analogy Score: {scores:.4f}")
print(f"Total correct: {len(results[-1]['correct'])}")
print(f"Total incorrect: {len(results[-1]['incorrect'])}")

Analogy Score: 0.8081
Total correct: 80
Total incorrect: 19


In [77]:
""" one incorrectly-solved analogy (selected at random), including also the error made by the model and the rank of the correct answer, thus adding:
  - a fifth word, which is the incorrect one found by the model
  - a sixth term, which is the integer indicating the rank (or 0) """

# Extract incorrect analogies
incorrect_analogies = results[-1]['incorrect']

# Select one randomly (if any exist)
if incorrect_analogies:
    random_incorrect = random.choice(incorrect_analogies)
    print(f"Incorrect Analogy: {random_incorrect[0]}, {random_incorrect[1]}, {random_incorrect[2]} → {random_incorrect[3]} (Predicted: {random_incorrect[4]}, Rank: {random_incorrect[5]})")
else:
    print("No incorrect analogies found.")


Incorrect Analogy: BAGHDAD, IRAQ, OTTAWA → CANADA (Predicted: PRIME_MINISTER_JEAN_CHRÉTIEN, Rank: 0)


In [79]:
# Extract one correctly-solved analogy selected at random (in principle, four terms).
correct_analogies = results[-1]['correct']

# Select one randomly (if any exist)
if correct_analogies:
    random_correct = random.choice(correct_analogies)
    print(f"Correct Analogy: {random_correct[0]}, {random_correct[1]}, {random_correct[2]} → {random_correct[3]}")
else:
    print("No correct analogies found.")


Correct Analogy: BANGKOK, THAILAND, PARIS → FRANCE


**2e.** Please write a function to compute the MRR score given a structure with correctly and incorrectly solved analogies, such as the one that is found in the results from `evaluate_word_analogies`.  The structure is not divided into categories.

The Mean Reciprocal Rank (please use the [formula here](https://en.wikipedia.org/wiki/Mean_reciprocal_rank)) gives some credit for incorrectly solved analogies, in inverse proportion to the rank of the correct answer among the candidates.  This rank is 1 for correctly solved analogies (full credit), and 1/k (or 0) for incorrectly solved ones.

In [80]:
# Please define here the function that computes MRR from the information stored in analogy_scores
def myMRR(analogies):
    """
    Compute the Mean Reciprocal Rank (MRR) from the analogy evaluation results.

    Parameters:
    - analogies: A dictionary containing 'correct' and 'incorrect' analogies.

    Returns:
    - MRR score (float)
    """
    ranks = []

    # Process correct analogies (full credit, rank = 1)
    for analogy in analogies['correct']:
        ranks.append(1.0)

    # Process incorrect analogies (credit depends on rank)
    for analogy in analogies['incorrect']:
        rank = analogy[5]  # The rank of the correct answer
        ranks.append(1.0 / rank if rank > 0 else 0.0)

    # Compute MRR
    return sum(ranks) / len(ranks) if ranks else 0.0  # Avoid division by zero


In [84]:
analogy_scores = (scores, results)

In [85]:
# Please test your MRR function by running the following code, which  displays the total number of analogy tasks, 
# the number of different categories (sections), the accuracy of the results (total number of correctly 
# solved analogies), and the MRR score of the results:
print("Total number of analogies:",  # The last dictionary is the total
      len(analogy_scores[1][-1]['correct']) + 
      len(analogy_scores[1][-1]['incorrect']))
print("Total number of categories:", len(analogy_scores[1]) - 1) # the "total" is excluded 
print(f"Overall accuracy: {analogy_scores[0]:.2f} and MRR: {myMRR(analogy_scores[1][-1]):.2f}")

Total number of analogies: 99
Total number of categories: 1
Overall accuracy: 0.81 and MRR: 0.81


**2f.** Please compute now the accuracy and MRR and the total time for the entire `questions-words.txt` file.  Is the timing compatible with your estimate from (2b)?  What do you think about the difference between accuracy and MRR? 

In [91]:
# Please write your Python code below and execute it.

# Function to compute Mean Reciprocal Rank (MRR)
def compute_mrr(results):
    reciprocal_ranks = []
    for analogy in results['correct']:
        reciprocal_ranks.append(1)  # Correct answers have rank 1
        
    # Handle incorrectly solved analogies
    for analogy in results['incorrect']:
        if len(analogy) >= 6:
            rank = analogy[5] # Extract the rank of the correct answer
            reciprocal_ranks.append(1 / rank if rank > 0 else 0)
        else:
            reciprocal_ranks.append(0)
    return sum(reciprocal_ranks) / len(reciprocal_ranks) if reciprocal_ranks else 0

In [92]:

#Start timing
start_time_og_mrr = time.time()

# Run evaluation on the full analogy dataset
analogy_scores_og_mrr = my_evaluate_word_analogies(wv_model, input_file)

# End timing
end_time_og_mrr = time.time()
total_time_og_mrr = end_time_og_mrr - start_time_og_mrr  # Compute total execution time

# Extract accuracy and MRR
accuracy_score_og_mrr, sections_og_mrr = analogy_scores_og_mrr
total_section_og_mrr = sections_og_mrr[-1]  # The last section contains total results
mrr_score = compute_mrr(total_section_og_mrr)

# Print results
print("Total number of analogies:", len(total_section_og_mrr['correct']) + len(total_section_og_mrr['incorrect']))
print("Total number of categories:", len(sections_og_mrr) - 1)  # Excluding the "Total accuracy" section
print(f"Overall accuracy: {accuracy_score_og_mrr:.4f}")
print(f"MRR Score: {mrr_score:.4f}")
print(f"Total time taken: {total_time_og_mrr:.2f} seconds")


Total number of analogies: 19544
Total number of categories: 14
Overall accuracy: 0.7320
MRR Score: 0.7320
Total time taken: 412.27 seconds


In [94]:
print(f'total time taken: {total_time_og_mrr/60:.2f} minutes') 

total time taken: 6.87 minutes


The total time for 'questions-words.txt' when using the function 'evaluate_word_analogies' is 6.83 minutes,  
while total time for same original file when using function 'my_evaluate_word_analogies' and using MRR score is 6.87 minutes. Hence there is not much difference.  
Overall accuracy: 0.7320 and MRR Score: 0.7320 are also same so no difference.

## End of AdvNLP Lab 4
Please submit your lab report as a .ipynb file after you have fully run and checked it in Google Colab; then upload it to Moodle.
Please submit one notebook per group only and do not forget to put the last names of all team members in the filename.