<a href="https://colab.research.google.com/github/Saputoa21/ADS_2024_Saputoa/blob/master/exercises/HomeExercise2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Home Exericse 2: Word Embeddings
In this second home exercise, you will use the knowledge from Tutorial 3 to perform a more systematic evaluation of embeddings based on a small analogy dataset.

In this notebook, please complete all instructions starting with 👋 ⚒ in the code cell after the sign or provide your analysis in the text cell after the sign.

## **Word2Vec Analogy-based Evaluation**

We first need to load the pretrained embeddings and the dataset. The dataset can be found on [GitHub](https://github.com/dgromann/cl_intro_ws2024/blob/main/exercises/HomeExercise2.txt) and will be loaded directly from there.

In [213]:
!wget https://github.com/dgromann/cl_intro_ws2024/raw/main/word2vec_embeddings.bin
!wget !wget https://raw.githubusercontent.com/dgromann/cl_intro_ws2024/master/exercises/HomeExercise2.txt

--2024-11-24 17:58:34--  https://github.com/dgromann/cl_intro_ws2024/raw/main/word2vec_embeddings.bin
Resolving github.com (github.com)... 140.82.112.3
Connecting to github.com (github.com)|140.82.112.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/dgromann/cl_intro_ws2024/main/word2vec_embeddings.bin [following]
--2024-11-24 17:58:34--  https://raw.githubusercontent.com/dgromann/cl_intro_ws2024/main/word2vec_embeddings.bin
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.108.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 96769269 (92M) [application/octet-stream]
Saving to: ‘word2vec_embeddings.bin.5’


2024-11-24 17:58:35 (153 MB/s) - ‘word2vec_embeddings.bin.5’ saved [96769269/96769269]

--2024-11-24 17:58:35--  http://!wget/
Resolving !wget

Then we need to load the model with gensim so that we can access the embeddings.

In [214]:
import gensim

model = gensim.models.KeyedVectors.load_word2vec_format("word2vec_embeddings.bin", binary=True)

And we need to open the HomeExercise2.txt file that contains analogy pairs.

In [215]:
analogy = open("HomeExercise2.txt", "r")
analogy_lines = analogy.readlines()

To look at the first few lines, the following code can be used. The analogies are grouped by categories that is indicated on the line before the analogies are listed with a colon :. The last and fourth element of the line represents the true result we will use to evaluate the embedding model.

In [216]:
line_no = 0
for line in analogy_lines:
  line_no += 1
  print(f"Line number {line_no} with analogy {line}")
  if line_no == 5:
    break

Line number 1 with analogy : capital-common-countries

Line number 2 with analogy Athens Greece Baghdad Iraq

Line number 3 with analogy Athens Greece Berlin Germany

Line number 4 with analogy Athens Greece Cairo Egypt

Line number 5 with analogy Athens Greece Canberra Australia



In [217]:
print(analogy_lines[:3])

[': capital-common-countries\n', 'Athens Greece Baghdad Iraq\n', 'Athens Greece Berlin Germany\n']


👋 ⚒ Systematically evaluate this simple word embedding model based on the entire analogy dataset. To do this:


*   Use the analogy function from Tutorial 3 to obtain 'd'
*   Compare 'd' with the true result from the `HomeExercise1.txt` file
*   Calculate the accuracy for all analogies (how many times out of all attempts did the embedding model provide the correct result)
*   Calculate the accuracy for each analogy category separately

When parsing the file, pay attention to the lines indicated with the colon : that represent the analogy categories and not analogies.


In [218]:
# Your code here
# Example: Athens is to Greece as Baghdad is to ?
# True result from file: Iraqi
# Model result also Iraqi

def analogy_function(a, b, c):
  result = model.most_similar(positive=[b,c], negative=[a])
  return result[0][0]

print(analogy_function("Athens", "Greece", "Baghdad"))

Iraqi


In [219]:
# counting analogy lines
total_analogies = len(analogy_lines)
print(total_analogies)

9009


In [220]:
# selecting and counting categories
categories_list = []
for category in analogy_lines:
  if ':' in category:
    categories_list.append(category)

print(categories_list)
print(len(categories_list))

[': capital-common-countries\n', ': capital-world\n', ': currency\n', ': city-in-state\n', ': family\n', ': gram1-adjective-to-adverb\n', ': gram2-opposite\n', ': gram3-comparative\n', ': gram4-superlative\n', ': gram5-present-participle\n', ': gram6-nationality-adjective\n', ': gram7-past-tense\n', ': gram8-plural\n', ': gram9-plural-verbs\n']
14


In [221]:
def parse_analogies(analogies_list):
    categories = {}
    current_category = None
    for line in analogies_list:
        line = line.strip()
        if line.startswith(":"):  # Category marker
            current_category = line[2:]
            categories[current_category] = []
        else:
            parts = line.split()
            if len(parts) == 4:  # Ensure the analogy format is correct
                categories[current_category].append(parts)
    return categories

parsed_analogies = parse_analogies(analogy_lines)
print(parsed_analogies)

{'capital-common-countries': [['Athens', 'Greece', 'Baghdad', 'Iraq'], ['Athens', 'Greece', 'Berlin', 'Germany'], ['Athens', 'Greece', 'Cairo', 'Egypt'], ['Athens', 'Greece', 'Canberra', 'Australia'], ['Athens', 'Greece', 'Helsinki', 'Finland'], ['Athens', 'Greece', 'London', 'England'], ['Athens', 'Greece', 'Madrid', 'Spain'], ['Athens', 'Greece', 'Moscow', 'Russia'], ['Athens', 'Greece', 'Ottawa', 'Canada'], ['Athens', 'Greece', 'Paris', 'France'], ['Athens', 'Greece', 'Rome', 'Italy'], ['Athens', 'Greece', 'Stockholm', 'Sweden'], ['Athens', 'Greece', 'Tehran', 'Iran'], ['Athens', 'Greece', 'Tokyo', 'Japan'], ['Baghdad', 'Iraq', 'Berlin', 'Germany'], ['Baghdad', 'Iraq', 'Cairo', 'Egypt'], ['Baghdad', 'Iraq', 'Canberra', 'Australia'], ['Baghdad', 'Iraq', 'Helsinki', 'Finland'], ['Baghdad', 'Iraq', 'London', 'England'], ['Baghdad', 'Iraq', 'Madrid', 'Spain'], ['Baghdad', 'Iraq', 'Moscow', 'Russia'], ['Baghdad', 'Iraq', 'Ottawa', 'Canada'], ['Baghdad', 'Iraq', 'Paris', 'France'], ['Bagh

In [222]:
# prooving whether the categorisation was done right
for key, value in parsed_analogies.items():
    print(f"{key}: {value}")

capital-common-countries: [['Athens', 'Greece', 'Baghdad', 'Iraq'], ['Athens', 'Greece', 'Berlin', 'Germany'], ['Athens', 'Greece', 'Cairo', 'Egypt'], ['Athens', 'Greece', 'Canberra', 'Australia'], ['Athens', 'Greece', 'Helsinki', 'Finland'], ['Athens', 'Greece', 'London', 'England'], ['Athens', 'Greece', 'Madrid', 'Spain'], ['Athens', 'Greece', 'Moscow', 'Russia'], ['Athens', 'Greece', 'Ottawa', 'Canada'], ['Athens', 'Greece', 'Paris', 'France'], ['Athens', 'Greece', 'Rome', 'Italy'], ['Athens', 'Greece', 'Stockholm', 'Sweden'], ['Athens', 'Greece', 'Tehran', 'Iran'], ['Athens', 'Greece', 'Tokyo', 'Japan'], ['Baghdad', 'Iraq', 'Berlin', 'Germany'], ['Baghdad', 'Iraq', 'Cairo', 'Egypt'], ['Baghdad', 'Iraq', 'Canberra', 'Australia'], ['Baghdad', 'Iraq', 'Helsinki', 'Finland'], ['Baghdad', 'Iraq', 'London', 'England'], ['Baghdad', 'Iraq', 'Madrid', 'Spain'], ['Baghdad', 'Iraq', 'Moscow', 'Russia'], ['Baghdad', 'Iraq', 'Ottawa', 'Canada'], ['Baghdad', 'Iraq', 'Paris', 'France'], ['Baghdad

In [223]:
def calucalte_accuracy(categories, analogy_function):
    total_correct = 0
    total_attempts = 0
    category_accuracies = {}
    for category, analogies in categories.items():
        correct = 0
        attempts = len(analogies)
        total_attempts += attempts
        for analogy in analogies:
            a, b, c, true_d = analogy
            predicted_d = analogy_function(a, b, c)
            if predicted_d == true_d:
                correct += 1
        total_correct += correct
        category_accuracies[category] = (correct, attempts, correct / attempts if attempts > 0 else 0)
    overall_accuracy = total_correct / total_attempts if total_attempts > 0 else 0
    return overall_accuracy, category_accuracies

accuracy = calucalte_accuracy(parsed_analogies, analogy_function)
print(accuracy)

(0.7545302946081156, {'capital-common-countries': (183, 210, 0.8714285714285714), 'capital-world': (283, 316, 0.8955696202531646), 'currency': (0, 10, 0.0), 'city-in-state': (862, 1095, 0.7872146118721461), 'family': (169, 182, 0.9285714285714286), 'gram1-adjective-to-adverb': (246, 812, 0.30295566502463056), 'gram2-opposite': (271, 506, 0.5355731225296443), 'gram3-comparative': (1215, 1332, 0.9121621621621622), 'gram4-superlative': (239, 272, 0.8786764705882353), 'gram5-present-participle': (769, 992, 0.7752016129032258), 'gram6-nationality-adjective': (608, 634, 0.9589905362776026), 'gram7-past-tense': (921, 1332, 0.6914414414414415), 'gram8-plural': (613, 702, 0.8732193732193733), 'gram9-plural-verbs': (408, 600, 0.68)})


## **Comparison: GloVe Analogy-based Evaluation**

The next step will consist of comparing this very small word2vec embedding model with a different small but more powerfull model available in gensim.

All models and corpora available in gensim can be found [here](https://github.com/piskvorky/gensim-data).

Since this model is considerably bigger than the tiny word2vec model, it takes some time to load when you run the following code cell.

In [224]:
import gensim.downloader as api
from gensim.models import KeyedVectors

model_glove = api.load("glove-wiki-gigaword-100")
print(type(model_glove))

<class 'gensim.models.keyedvectors.KeyedVectors'>


The model can then be used exactly the same as the word2vec model, since gensim standardizes model access.

In [225]:
model_glove["bread"]

array([-0.66146  ,  0.94335  , -0.72214  ,  0.17403  , -0.42524  ,
        0.36303  ,  1.0135   , -0.14802  ,  0.25817  , -0.20326  ,
       -0.64338  ,  0.16632  ,  0.61518  ,  1.397    , -0.094506 ,
        0.0041843, -0.18976  , -0.55421  , -0.39371  , -0.22501  ,
       -0.34643  ,  0.32076  ,  0.34395  , -0.7034   ,  0.23932  ,
        0.69951  , -0.16461  , -0.31819  , -0.34034  , -0.44906  ,
       -0.069667 ,  0.35348  ,  0.17498  , -0.95057  , -0.2209   ,
        1.0647   ,  0.23231  ,  0.32569  ,  0.47662  , -1.1206   ,
        0.28168  , -0.75172  , -0.54654  , -0.66337  ,  0.34804  ,
       -0.69058  , -0.77092  , -0.40167  , -0.069351 , -0.049238 ,
       -0.39351  ,  0.16735  , -0.14512  ,  1.0083   , -1.0608   ,
       -0.87314  , -0.29339  ,  0.68278  ,  0.61634  , -0.088844 ,
        0.88094  ,  0.099809 , -0.27161  , -0.58026  ,  0.50364  ,
       -0.93814  ,  0.67576  , -0.43124  , -0.10517  , -1.2404   ,
       -0.74353  ,  0.28637  ,  0.29012  ,  0.89377  ,  0.6740

👋 ⚒  Run the same systematic analysis for this gensim model as for the word2vec model above. Which model performs better overall and in specific categories?

In [231]:
vocab = model_glove.key_to_index
print(vocab)



In [235]:
if 'Athens' in vocab:
  print('True')
else:
  print('False')

False


In [237]:
if 'athens' in vocab:
  print('True')
else:
  print('False')

True


In [226]:
print("Length of the vocabulary",len(model_glove.key_to_index))

# Print the embedding of a specific word
print("Embedding for the word good: ", model_glove["good"])

Length of the vocabulary 400000
Embedding for the word good:  [-0.030769   0.11993    0.53909   -0.43696   -0.73937   -0.15345
  0.081126  -0.38559   -0.68797   -0.41632   -0.13183   -0.24922
  0.441      0.085919   0.20871   -0.063582   0.062228  -0.051234
 -0.13398    1.1418     0.036526   0.49029   -0.24567   -0.412
  0.12349    0.41336   -0.48397   -0.54243   -0.27787   -0.26015
 -0.38485    0.78656    0.1023    -0.20712    0.40751    0.32026
 -0.51052    0.48362   -0.0099498 -0.38685    0.034975  -0.167
  0.4237    -0.54164   -0.30323   -0.36983    0.082836  -0.52538
 -0.064531  -1.398     -0.14873   -0.35327   -0.1118     1.0912
  0.095864  -2.8129     0.45238    0.46213    1.6012    -0.20837
 -0.27377    0.71197   -1.0754    -0.046974   0.67479   -0.065839
  0.75824    0.39405    0.15507   -0.64719    0.32796   -0.031748
  0.52899   -0.43886    0.67405    0.42136   -0.11981   -0.21777
 -0.29756   -0.1351     0.59898    0.46529   -0.58258   -0.02323
 -1.5442     0.01901   -0.0158

In [227]:
# Your code here

def analogy_function(a, b, c):
  result = model_glove.most_similar(positive=[b,c], negative=[a])
  return result[0][0]

print(analogy_function("good", "best", "bad"))

worst


In [228]:
parsed_analogies_glove = {k: v for k, v in parsed_analogies.items() if k != 'capital-common-countries' and k != 'capital-world' and k != 'currency'}
print(parsed_analogies_glove)

{'city-in-state': [['Chicago', 'Illinois', 'Houston', 'Texas'], ['Chicago', 'Illinois', 'Philadelphia', 'Pennsylvania'], ['Chicago', 'Illinois', 'Phoenix', 'Arizona'], ['Chicago', 'Illinois', 'Dallas', 'Texas'], ['Chicago', 'Illinois', 'Indianapolis', 'Indiana'], ['Chicago', 'Illinois', 'Austin', 'Texas'], ['Chicago', 'Illinois', 'Detroit', 'Michigan'], ['Chicago', 'Illinois', 'Memphis', 'Tennessee'], ['Chicago', 'Illinois', 'Boston', 'Massachusetts'], ['Chicago', 'Illinois', 'Seattle', 'Washington'], ['Chicago', 'Illinois', 'Denver', 'Colorado'], ['Chicago', 'Illinois', 'Baltimore', 'Maryland'], ['Chicago', 'Illinois', 'Louisville', 'Kentucky'], ['Chicago', 'Illinois', 'Milwaukee', 'Wisconsin'], ['Chicago', 'Illinois', 'Portland', 'Oregon'], ['Chicago', 'Illinois', 'Tucson', 'Arizona'], ['Chicago', 'Illinois', 'Fresno', 'California'], ['Chicago', 'Illinois', 'Sacramento', 'California'], ['Chicago', 'Illinois', 'Atlanta', 'Georgia'], ['Chicago', 'Illinois', 'Omaha', 'Nebraska'], ['Chic

In [229]:
accuracy_glove = calucalte_accuracy(parsed_analogies_glove, analogy_function)
print(accuracy_glove)

KeyError: "Key 'Illinois' not present in vocabulary"

In [None]:
accuracy = calucalte_accuracy(parsed_analogies, analogy_function)
print(accuracy)

## **Visual Comparison**

As a final step, use the visualization from Tutorial 3 to visually output the two models based on the following words.

👋 ⚒  ❓ Do the clusters (groupings of embeddings) in the GloVe visualization differ substantially from the clusters in the word2vec visualization from Tutorial 3?

In [None]:
import numpy as np

from sklearn.decomposition import PCA
from matplotlib import pyplot as plt

def display_pca_scatterplot(model, words):

    word_vectors = np.array([model[w] for w in words])

    twodim = PCA().fit_transform(word_vectors)[:,:2]

    plt.figure(figsize=(6,6))
    plt.scatter(twodim[:,0], twodim[:,1], edgecolors='k', c='r')
    for word, (x,y) in zip(words, twodim):
        plt.text(x+0.05, y+0.05, word)

In [None]:
display_pca_scatterplot(model_glove,
                        ['boy', 'girl', 'brother', 'sister',
                        'bad', 'worse', 'big', 'bigger'])

In [None]:
display_pca_scatterplot(model,
                        ['boy', 'girl', 'brother', 'sister',
                        'bad', 'worse', 'big', 'bigger'])

**Provide your answer to the question on the clusters here.**

## **Bias in Embeddings**

Language models and also embedding models tend to reflect on bias that is present in the textual data they were trained on. This can also be analyzed with embeddings by explicitly testing biased analogies.

For instance, man is to doctor as woman is to ?

The bias here is that professions tend to be assigned a specific gender, e.g. men are doctors and women are nurses.

The same is true for cultures and cultural bias, e.g. Bratwurst or Sauerkraut and Germany.



In [None]:
result1 = model_glove.most_similar(positive=["doctor", "woman"], negative=["man"], topn=3)
print(f"man is to doctor as woman is to {result1}")
result2 = model_glove.most_similar(positive=["bratwurst", "france"], negative=["germany"], topn=3)
print(f"Germany is to Bratwurst as France is to {result2}")

👋 ⚒ Try to come up with two biased analogies yourself and test if the GloVe and word2vec models suffers from this type of bias. Please try to be creative and do not just change woman to girl and man to boy or something similar.

In [None]:
# Test your biased analogies on both models here