<a href="https://colab.research.google.com/github/Saputoa21/ADS_2024_Saputoa/blob/master/Bonus_Exercise_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Bonus Exercises 1: Aligning Multilingual Embedding Spaces**



This notebook represents the first bonus exercises for the lecture Multilingual and Crosslingual Methods and Language Resources (2024W 340168-1). For each successfully completed bonus exercise, a maximum of three points can be achieved that will be added to the points of the final exam. The tasks to be completed in the following notebook are marked with 👋 ⚒.



In this notebook, you will perform and evaluate a supervised method for aligning the embedding spaces of two languages. The examples in the notebook rely on the language pair English-German, however, feel free to change this pair to languages of your choice from the available embeddings and dictionaries (see below).

-----------
## **Preparing the Embeddings and Data**

In this notebook, we will be using fastText embeddings that represents a character-based version of the word2vec skipgram method. Details on the method can be found in the [original publication](https://aclanthology.org/Q17-1010.pdf) and [this website](https://fasttext.cc/).

Pretrained fastText embeddings are available in [157 languages](https://fasttext.cc/docs/en/crawl-vectors.html). The following code cell loads the fastText embeddings for English and German.

👋 ⚒ Please change the following download command if you wish to align other languages than English and German.

Before you decide on a final language pair, please make sure that:
1.   There are pretrained embeddings for this language (see [here](https://fasttext.cc/docs/en/crawl-vectors.html))
2.   There is a bilingual word list available (see the [MUSE GitHub](https://github.com/facebookresearch/MUSE/tree/main) section "Ground-truth bilingual dictionaries")

If the embeddings are available, change the two-digit ISO code in `cc.en.300.vec.g` and `cc.de.300.vec.gz` to the language(s) of your choice.

In [3]:
!wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.vec.gz  # English
!wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.de.300.vec.gz  # German

--2025-01-22 14:28:03--  https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.vec.gz
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 18.238.176.115, 18.238.176.19, 18.238.176.126, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|18.238.176.115|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1325960915 (1.2G) [binary/octet-stream]
Saving to: ‘cc.en.300.vec.gz’


2025-01-22 14:28:14 (116 MB/s) - ‘cc.en.300.vec.gz’ saved [1325960915/1325960915]

--2025-01-22 14:28:14--  https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.de.300.vec.gz
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 18.238.176.115, 18.238.176.19, 18.238.176.126, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|18.238.176.115|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1278030050 (1.2G) [binary/octet-stream]
Saving to: ‘cc.de.300.vec.gz’


2025-01-22 14:28:29 (83.8 MB/s) - ‘cc.de.300.vec.gz’ sa

### Loading the Embeddings

As a next step we will unzip and load the embeddings. For this alignment task, we will only use the top 100,000 words for both languages to speed up the processing. This choice of only using the top 100,000 words also depends on the lenght of the available bilingual word lists.

In [4]:
import gzip
import numpy as np

def load_fasttext_embeddings(file_path, top_n):
    embeddings = {}
    with gzip.open(file_path, 'rb') as f:
        for i, line in enumerate(f):
            # Line 0 is a header line
            if i > 0 and i <= top_n:
              tokens = line.decode('utf-8').strip().split(' ')
              word = tokens[0]
              vector = np.array(tokens[1:], dtype=np.float32)
              vector = vector / np.linalg.norm(vector)
              embeddings[word] = vector
    return embeddings

# Load the top English and German embeddings for the top 100,000 words (100000)
# FastText sorts the embeddings by decreasing order of word frequency by default
en_embeddings = load_fasttext_embeddings('cc.en.300.vec.gz', 100000)
de_embeddings = load_fasttext_embeddings('cc.de.300.vec.gz', 100000)

print(f"Loaded {len(en_embeddings)} English embeddings")
print(f"Loaded {len(de_embeddings)} German embeddings")


Loaded 100000 English embeddings
Loaded 100000 German embeddings


Let us explore the format of the downloaded and loaded embeddings.

In [5]:
print(f'The loaded embeddings represent a {type(en_embeddings)} datatype.\n')
print(f'Each entry represents the word and the related embedding.\n')
print(f'We can query the word as a key and obtain the embedding, e.g. for good the embedding is {en_embeddings["good"]}.\n')
print(f'The dimensionality of these embeddings corresponds to {len(en_embeddings["good"])}.')

The loaded embeddings represent a <class 'dict'> datatype.

Each entry represents the word and the related embedding.

We can query the word as a key and obtain the embedding, e.g. for good the embedding is [-0.08404064 -0.05785208  0.00155124  0.1233691  -0.05985956  0.00565746
  0.11506542 -0.01505614  0.01587738 -0.00118624 -0.0886031   0.02126109
  0.00912493  0.00419747  0.01450865  0.0062962   0.07829193 -0.01815862
 -0.0549321  -0.02126109  0.01076742  0.07500695  0.01359615  0.00821244
  0.00638745 -0.05867332  0.03056853 -0.01916236  0.0617758   0.0275573
  0.06569952 -0.05192087 -0.03987596  0.00583996  0.04005846  0.05520585
 -0.00556621 -0.11187168 -0.03221102 -0.02463732 -0.01879736  0.0068437
 -0.0062962   0.03312351 -0.03020353  0.05292461  0.00757369 -0.05785208
 -0.05274212  0.00994618 -0.08440564  0.0142349  -0.03722973  0.00611371
 -0.05812582  0.05365461  0.06579077 -0.04918339 -0.13377152 -0.03695598
 -0.02290358 -0.04516842 -0.04763215 -0.06250579  0.04261344  0.0

### Downloading and Loading the Bilingual Word List

To perform this alignment, we will use a bilingual word list that is provided by the Multilingual Unsupervised and Supervised Embeddings (MUSE) project (see [here](https://github.com/facebookresearch/MUSE/tree/main) for all languages).

👋 ⚒ Please change the following downloading command to the language pair of your choice (as long as available on MUSE).


In [6]:
!wget https://dl.fbaipublicfiles.com/arrival/dictionaries/en-de.txt

--2025-01-22 14:30:04--  https://dl.fbaipublicfiles.com/arrival/dictionaries/en-de.txt
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 3.171.22.118, 3.171.22.33, 3.171.22.68, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|3.171.22.118|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1742131 (1.7M) [text/x-c++]
Saving to: ‘en-de.txt’


2025-01-22 14:30:05 (5.88 MB/s) - ‘en-de.txt’ saved [1742131/1742131]



### Creating a bilingual word list

As a next step, we will create a bilingual word list from the donwloaded text file.

👋 ⚒ Create a list of tuples `[(en_word1, de_word1), (en_word2, de_word2),...]`from the downloaded text file in the following code cell. To complete this task, please complement the provided function `load_bilingual_word_list` where it says `Your code here`.

For English-German, the first ten tuples of the list look like this:

```
[('the', 'die'), ('the', 'der'), ('the', 'dem'), ('the', 'den'), ('the', 'das'), ('and', 'sowie'), ('and', 'und'), ('was', 'war'), ('was', 'wurde'), ('for', 'für')]
```



In [7]:
'''
Create a list of tuples that contain word translations

Parameters:
Text file with one bilingual word pair per line

Returns:
A list of tuples that each contains one bilingual word pair
'''
def load_bilingual_word_list(file_path):
    bilingual_dict = []
    with open(file_path, 'r', encoding='utf-8') as f:
      # Your code here
      for line in f:
          tokens = str(line).strip().split(' ')
          en_word = tokens[0]
          de_word = tokens[1]
          word_pair = (en_word, de_word)
          bilingual_dict.append(word_pair)
    return bilingual_dict

# Load English-German word pairs
en_de_pairs = load_bilingual_word_list('en-de.txt')

print(en_de_pairs[:10])

[('the', 'die'), ('the', 'der'), ('the', 'dem'), ('the', 'den'), ('the', 'das'), ('and', 'sowie'), ('and', 'und'), ('was', 'war'), ('was', 'wurde'), ('for', 'für')]


### Getting the Embeddings for our Word List

As a next step, we need to see which words from the word list have a vector representation in the embedding space for both languages and create a list of corresponding embeddings for both languages.


In [8]:
import numpy as np

'''
Function to create a list of word embeddings that is parallel to a bilingual list of words

Parameters:
Bilingual list of words, embeddings in the first language, embeddings in the second language

Returns:
Two numpy arrays of embeddings that correspond two the bilingual word list
'''
def extract_word_embeddings(bilingual_pairs, en_embeddings, de_embeddings):
    en_vecs = []
    de_vecs = []

    for en_word, de_word in bilingual_pairs:
        if en_word in en_embeddings and de_word in de_embeddings:
            en_vecs.append(en_embeddings[en_word])
            de_vecs.append(de_embeddings[de_word])

    # Convert lists to numpy arrays
    en_vecs = np.array(en_vecs)
    de_vecs = np.array(de_vecs)

    return en_vecs, de_vecs

# Extract English and German embeddings for the bilingual lexicon
en_vecs, de_vecs = extract_word_embeddings(en_de_pairs, en_embeddings, de_embeddings)

print(f"Extracted {en_vecs.shape[0]} aligned word vectors in English.")
print(f"Extracted {de_vecs.shape[0]} aligned word vectors in German.\n")

print(de_vecs[0],"\n")
print(len(de_vecs[0]))

Extracted 22546 aligned word vectors in English.
Extracted 22546 aligned word vectors in German.

[-5.72045334e-03  1.09642027e-02  9.52727944e-02 -2.60144435e-02
 -2.86022667e-03  1.80534795e-01  2.54015401e-02  3.55485342e-02
  6.81006350e-03 -9.54089984e-02 -1.16452100e-02 -5.85665507e-03
 -6.66705295e-02 -2.73083560e-02  8.17207620e-03 -1.15771091e-03
 -3.94983683e-03 -9.94269364e-03  1.40968319e-02 -6.12905715e-04
  5.92475571e-03 -3.13262939e-02  6.81006408e-04  7.20504746e-02
 -4.89643589e-02  1.15771091e-03  8.64878111e-03 -3.07814889e-02
 -2.79212627e-03 -8.80541205e-02  4.79428507e-02 -2.42438260e-02
 -3.42546217e-02 -1.54111743e-01  7.49107043e-04  8.98928382e-03
 -4.69894381e-03  7.30038807e-02  1.37699485e-01 -1.71613600e-02
 -6.12905715e-04 -1.25305178e-02 -3.51399295e-02 -4.28353027e-02
 -4.42654174e-03  6.50361106e-02  8.85308348e-03  2.60144435e-02
  1.28710205e-02 -2.56807506e-01  4.76704445e-04 -1.54588455e-02
 -5.24374889e-03 -2.99642817e-03 -3.38460170e-02  4.33801

-----------
## **Embedding Alignment**

We will now use the dictionary and embeddings to align the two vector spaces. The English vector space will be aligned to the German vector space using the [Procrustes](https://en.wikipedia.org/wiki/Orthogonal_Procrustes_problem) alignment method.

Given two matrices, Procrustes finds an orthogonal matrix which most closely maps one input matrix to the other. As a first step, we need to compute this orthogonal transformation matrix.  



In [9]:
"""
Function to perform orthogonal Procrustes alignment to learn a mapping from X to Y.

Parameters:
X (numpy array): Source language word embeddings (English)
Y (numpy array): Target language word embeddings (German)

Returns:
W (numpy array): Orthogonal transformation matrix
"""
def orthogonal_procrustes(X, Y):
    X = X / np.linalg.norm(X, axis=1, keepdims=True)
    Y = Y / np.linalg.norm(Y, axis=1, keepdims=True)

    # Compute matrix product of X^T and Y
    M = np.dot(X.T, Y)

    # Perform Singular Value Decomposition (SVD) on the matrix M
    U, _, Vt = np.linalg.svd(M)

    # Compute the orthogonal transformation matrix W
    W = np.dot(U, Vt)

    return W

W = orthogonal_procrustes(en_vecs, de_vecs)

print("Orthogonal mapping matrix learned.")

print(W)
print(len(W))

Orthogonal mapping matrix learned.
[[ 0.04907615  0.02510272 -0.10426025 ... -0.12945467  0.08390459
   0.02807453]
 [ 0.0238353  -0.0221569  -0.00390244 ... -0.00869184  0.11116306
  -0.0604249 ]
 [ 0.0155552   0.11592178  0.06893986 ... -0.12564601 -0.00331625
  -0.01412949]
 ...
 [ 0.02709159 -0.03738086  0.03347469 ... -0.07060888  0.04953825
   0.03611384]
 [ 0.2748923  -0.0502222   0.07045681 ...  0.07736033  0.01274888
  -0.04421226]
 [-0.02611824 -0.01001144  0.10607699 ...  0.02234203 -0.03937639
   0.01735137]]
300


In a second step, the obtained matrix is used to learn an orthogonal mapping of the English vector space to approximate it to the German vector space. Here we can transform the entire vector space of 100,000 embeddings.

In [10]:
"""
Apply the learned orthogonal mapping to the source language embeddings.

Parameters:
embeddings (dict): Source language embeddings (English)
W (numpy array): Orthogonal transformation matrix

Returns:
mapped_embeddings (dict): Transformed embeddings
"""
def apply_mapping(embeddings, W):
    mapped_embeddings = {}
    for word, vec in embeddings.items():
        mapped_vec = np.dot(vec, W)
        # Normalize the mapped vector
        mapped_vec = mapped_vec / np.linalg.norm(mapped_vec)
        mapped_embeddings[word] = mapped_vec
    return mapped_embeddings

aligned_en_embeddings = apply_mapping(en_embeddings, W)

print(f"Aligned {len(aligned_en_embeddings)} English embeddings into the German space.\n")
print(aligned_en_embeddings['good'])

Aligned 100000 English embeddings into the German space.

[-0.02754474 -0.00419841  0.02682907 -0.08604427 -0.02118373 -0.05984071
 -0.08273644  0.04499711  0.00253932 -0.01150167  0.06392089 -0.08514164
  0.00175307 -0.04745492 -0.00339004 -0.0455097  -0.0571388   0.00212817
  0.10568504 -0.12549932 -0.02589734 -0.05530401 -0.02861719  0.03910202
 -0.00954981 -0.05215105  0.00966217  0.01890708 -0.04717619 -0.04405127
  0.05568903  0.03706347 -0.07151513 -0.00591891  0.05165812  0.02123319
  0.00586968  0.16810946  0.12827452 -0.061162   -0.06618541  0.01465025
 -0.01445273 -0.02892472 -0.02132701  0.05696031  0.03586195  0.05359138
  0.00192775 -0.10935646  0.0332997  -0.03852529 -0.07921883 -0.01483064
  0.10750414  0.03537637  0.0513382   0.03002121 -0.05640733 -0.11594495
 -0.01558273  0.03063875 -0.01620898 -0.00310953 -0.01153558 -0.02249036
  0.05014528 -0.01126666  0.01947777 -0.10998777  0.1124676   0.02874888
  0.00449159  0.01496598 -0.01910733  0.01091445 -0.0195179  -0.02

-----------
## **Evaluation**

In this part, you will explore two different tasks for evaluating the final vector space:


1.   Word Translation
2.   Cross-Lingual Analogy Completion



### Word Translation

We will now use the bilingual word list downloaded from MUSE to evaluate the ability of our newly created aligned embedding space to translate words from English to German.

A function that takes an English word as input and ouputs the nearest neighors of the German vector space is already provided for your convenience.

In [11]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def get_nn(word, aligned_en_embeddings, de_embeddings, top_k):
    #print("Nearest neighbors of \"%s\":" % word)
    en_vec = aligned_en_embeddings[word]
    de_words = list(de_embeddings.keys())
    de_vecs = np.array(list(de_embeddings.values()))

    # Compute cosine similarity between the English word vector and all German word vectors
    en_vec = en_vec / np.linalg.norm(en_vec)
    de_vecs_norm = de_vecs / np.linalg.norm(de_vecs, axis=1, keepdims=True)
    similarities = cosine_similarity([en_vec], de_vecs_norm).flatten()

    # Get top_k most similar German words
    nearest_idxs = similarities.argsort()[-top_k:][::-1]
    nearest_words = [de_words[i] for i in nearest_idxs]

    return nearest_words

en_word = 'the'
nearest_neighbors = get_nn(en_word, aligned_en_embeddings, de_embeddings, 5)
print(nearest_neighbors)

['der', 'die', 'den', 'dem', 'besagten']


👋 ⚒ Use the already downloaded bilingual word list to evaluate the ability of our aligned vector space to translate from English to German. The output of this task should be the **accuracy** calculated on **1000 words** from the word list, i.e., how many of the first 1000 English words result in five German neighbors that correspond to the German translation from the MUSE word list.

Use the provided function `get_nn` to obtain the *k* nearest words in the vector space in German, given an English input word.


In [12]:
print(en_de_pairs[3])
print(en_de_pairs[3][0])
print(en_de_pairs[3][1])
print(type(en_de_pairs))

('the', 'den')
the
den
<class 'list'>


In [44]:
def calculate_accuracy(aligned_en_embeddings, de_embeddings, en_de_pairs):
    correct_count = 0
    total_words = len(en_de_pairs)
    for en_word, correct_translation in en_de_pairs:
        if en_word in aligned_en_embeddings:
          nearest_neighbors = get_nn(en_word, aligned_en_embeddings, de_embeddings, top_k=5)
          if correct_translation in nearest_neighbors:
            correct_count += 1
        else:
            continue
    accuracy = correct_count / total_words
    return accuracy

first_1000_pairs = en_de_pairs[:1000]

accuracy = calculate_accuracy(aligned_en_embeddings, de_embeddings, first_1000_pairs)
print(f"Accuracy: {accuracy * 100:.2f}%")

Accuracy: 43.70%


### Cross-Lingual Analogy Completion

An analogy compares two related pairs of words, e.g. *man is to woman as king is to queen*. This task can be extended to use analogies for translation, e.g. *man is to woman as Mann ist zu Frau*.


👋 ⚒ Create **twenty** examples of crosslingual analogies and see whether the aligned vector space is able to correctly complete analogies across languages, e.g. positive=(queen, König), negative=(king). You can use examples from the analogy text file in GitHub for this purpose.

Hints:


*   Multilingual Analogies: To create the examples, all you need is a translation of an existing analogy. You can use the already loaded bilingual word list to obtain the translations and the existing analogy list (anlogies.txt on Github) to obtain analogies.
*   Implementation: In the code below, you only need to change the embeddings to the German embeddings for `c` and provide the function with the German embeddings.

In [13]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def norm(vec):
  return vec / np.linalg.norm(vec)

def get_target_words(embeddings, vec_a, vec_b, vec_c, top_k):
    words = list(embeddings.keys())
    vecs = np.array(list(embeddings.values()))

    # Compute analogy based on input vectors b+c-a (woman+king-man)
    positive = norm(vec_b+vec_c)
    target_vec = norm(positive - vec_a)
    vecs_norm = vecs / np.linalg.norm(vecs, axis=1, keepdims=True)
    similarities = cosine_similarity([target_vec], vecs_norm).flatten()

    # Get top_k most similar words for the retrieved result vector d
    nearest_idxs = similarities.argsort()[-top_k:][::-1]
    nearest_words = [words[i] for i in nearest_idxs]

    return nearest_words

vec_a = norm(aligned_en_embeddings["man"])
vec_b = norm(aligned_en_embeddings["woman"])
vec_c = norm(aligned_en_embeddings["king"])

nearest_neighbors = get_target_words(aligned_en_embeddings, vec_a, vec_b, vec_c, 5)
print(nearest_neighbors)

['queen', 'king', 'Queen', 'queens', 'royal']


In [14]:
vec_a = norm(aligned_en_embeddings["man"])
vec_b = norm(aligned_en_embeddings["woman"])
vec_c = norm(aligned_en_embeddings["König"])

nearest_neighbors = get_target_words(aligned_en_embeddings, vec_a, vec_b, vec_c, 5)
print(nearest_neighbors)

['König', 'Prinz', 'Birgit', 'Amalia', 'Wien']


In [15]:
vec_a = norm(aligned_en_embeddings["Mann"])
vec_b = norm(aligned_en_embeddings["Frau"])
vec_c = norm(aligned_en_embeddings["king"])

nearest_neighbors = get_target_words(aligned_en_embeddings, vec_a, vec_b, vec_c, 5)
print(nearest_neighbors)

['king', 'queen', 'empress', 'consort', 'princess']


In [16]:
# it is interesting that the lowercased german words are not accepred
vec_a = norm(aligned_en_embeddings["mann"])
vec_b = norm(aligned_en_embeddings["frau"])
vec_c = norm(aligned_en_embeddings["king"])

nearest_neighbors = get_target_words(aligned_en_embeddings, vec_a, vec_b, vec_c, 5)
print(nearest_neighbors)

KeyError: 'mann'

In [17]:
# Loading analogy file
!wget !wget https://raw.githubusercontent.com/dgromann/cl_intro_ws2024/master/exercises/HomeExercise2.txt

--2025-01-22 14:30:44--  http://!wget/
Resolving !wget (!wget)... failed: Name or service not known.
wget: unable to resolve host address ‘!wget’
--2025-01-22 14:30:44--  https://raw.githubusercontent.com/dgromann/cl_intro_ws2024/master/exercises/HomeExercise2.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 272272 (266K) [text/plain]
Saving to: ‘HomeExercise2.txt’


2025-01-22 14:30:44 (6.00 MB/s) - ‘HomeExercise2.txt’ saved [272272/272272]

FINISHED --2025-01-22 14:30:44--
Total wall clock time: 0.3s
Downloaded: 1 files, 266K in 0.04s (6.00 MB/s)


In [18]:
analogy = open("HomeExercise2.txt", 'r')
analogy_lines = analogy.readlines()
print(analogy_lines[:5])

[': capital-common-countries\n', 'Athens Greece Baghdad Iraq\n', 'Athens Greece Berlin Germany\n', 'Athens Greece Cairo Egypt\n', 'Athens Greece Canberra Australia\n']


In [19]:
preprocessed_analogy_lines = []
for line in analogy_lines:
  if ':' in line:
    continue
  else:
    line = str(line).split()
  preprocessed_analogy_lines.append(line)

print(preprocessed_analogy_lines[:5])
print(preprocessed_analogy_lines[0][1])

[['Athens', 'Greece', 'Baghdad', 'Iraq'], ['Athens', 'Greece', 'Berlin', 'Germany'], ['Athens', 'Greece', 'Cairo', 'Egypt'], ['Athens', 'Greece', 'Canberra', 'Australia'], ['Athens', 'Greece', 'Helsinki', 'Finland']]
Greece


In [20]:
# check to know how to access valuesfrom the tuples
for en_word, de_word in en_de_pairs[:5]:
  print(en_word)
  print(de_word)

the
die
the
der
the
dem
the
den
the
das


In [22]:
# Analogy check for English only
for analogy in preprocessed_analogy_lines[:20]:
    vec_a = norm(aligned_en_embeddings[analogy[0]])
    vec_b = norm(aligned_en_embeddings[analogy[1]])
    vec_c = norm(aligned_en_embeddings[analogy[2]])
    nearest_neighbors = get_target_words(aligned_en_embeddings, vec_a, vec_b, vec_c, 5)
    print(f"{analogy[0]} is to {analogy[1]} as {analogy[2]} is to {nearest_neighbors}")

Athens is to Greece as Baghdad is to ['Iraq', 'Baghdad', 'Iraqis', 'Iraqi', 'Mosul']
Athens is to Greece as Berlin is to ['Germany', 'Berlin', 'GDR', 'German', 'Deutschland']
Athens is to Greece as Cairo is to ['Egypt', 'Cairo', 'Morocco', 'Egyptian', 'Sisi']
Athens is to Greece as Canberra is to ['Canberra', 'Australia', 'Zealand', 'Australian', 'Australians']
Athens is to Greece as Helsinki is to ['Finland', 'Finnish', 'Helsinki', 'Scandinavia', 'Sweden']
Athens is to Greece as London is to ['England', 'London', 'U.K', 'Britain', 'U.K.']
Athens is to Greece as Madrid is to ['Spain', 'Madrid', 'Portugal', 'Ronaldo', 'Benzema']
Athens is to Greece as Moscow is to ['Russia', 'Russian', 'Ukraine', 'Kremlin', 'Moscow']
Athens is to Greece as Ottawa is to ['Canada', 'Ottawa', 'Quebec', 'Canadians', 'Ontario']
Athens is to Greece as Paris is to ['France', 'Paris', 'Parisian', 'Hollande', 'Sarkozy']
Athens is to Greece as Rome is to ['Italy', 'Vatican', 'papacy', 'Papacy', 'pontiff']
Athens 

In [34]:
nearest_neighbors_list  = []
for analogy in preprocessed_analogy_lines:
  vec_a = norm(aligned_en_embeddings[analogy[0]])
  vec_b = norm(aligned_en_embeddings[analogy[1]])
  for en_word, de_word in en_de_pairs:
    de_word = str(de_word.capitalize())
    if en_word == analogy[2] and de_word in aligned_en_embeddings:
      vec_c = norm(aligned_en_embeddings[de_word])
      nearest_neighbors = get_target_words(aligned_en_embeddings, vec_a, vec_b, vec_c, 5)
      nearest_neighbors_list.append(nearest_neighbors)
      print(f"{analogy[0]} is to {analogy[1]} as {de_word} is to {nearest_neighbors}")
    if len(nearest_neighbors_list) == 20:
      break
    else:
      continue

boy is to girl as Brothers is to ['Brothers', 'Sisters', 'Bros', 'Bros.', 'Associates']
boy is to girl as Dad is to ['Mom', 'Dad', 'Mum', 'Grandma', 'Husband']
boy is to girl as Papa is to ['Papa', 'Mama', 'Nana', 'Mamma', 'Caffe']
boy is to girl as Seine is to ['Seine', 'Loire', 'Meuse', 'Moselle', 'Montmartre']
boy is to girl as Sein is to ['Sein', 'Aung', 'Kyi', 'Frau', 'Zeit']
boy is to girl as Mann is to ['Mann', 'Whitney', 'Weil', 'McIntyre', 'Kerr']
boy is to girl as King is to ['King', 'Queen', 'Empress', 'Kings', 'Crown']
boy is to girl as König is to ['König', 'Prinz', 'Müller', 'Bernhard', 'Graf']
boy is to girl as Mann is to ['Mann', 'Whitney', 'Weil', 'McIntyre', 'Kerr']
boy is to girl as Man is to ['Woman', 'Man', 'Girl', 'Bitch', 'Lover']
boy is to girl as Sohn is to ['Sohn', 'Ahn', 'Hwang', 'Chairwoman', 'Blume']
boy is to girl as Sons is to ['Sons', 'Daughters', 'Girlfriends', 'Brides', 'Sisters']
brother is to sister as Brothers is to ['Sisters', 'Brothers', 'Bros', '

# Another Language Pair

I have chosen Spanish and Italian, since they have been provided in dictionaries by MUSE project.

P.s. I am currently learning Italian and I have never learnt Spanish, however I want to try to allign these languages and see their similarities.

In [24]:
!wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.es.300.vec.gz  # Spanish
!wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.it.300.vec.gz  # Italian

--2025-01-22 14:36:05--  https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.es.300.vec.gz
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 3.171.22.68, 3.171.22.33, 3.171.22.13, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|3.171.22.68|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1285580896 (1.2G) [binary/octet-stream]
Saving to: ‘cc.es.300.vec.gz’


2025-01-22 14:36:24 (65.6 MB/s) - ‘cc.es.300.vec.gz’ saved [1285580896/1285580896]

--2025-01-22 14:36:24--  https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.it.300.vec.gz
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 3.171.22.68, 3.171.22.33, 3.171.22.13, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|3.171.22.68|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1272825284 (1.2G) [binary/octet-stream]
Saving to: ‘cc.it.300.vec.gz’


2025-01-22 14:37:01 (33.9 MB/s) - ‘cc.it.300.vec.gz’ saved [1272825284/12728

In [25]:
es_embeddings = load_fasttext_embeddings('cc.es.300.vec.gz', 100000)
it_embeddings = load_fasttext_embeddings('cc.it.300.vec.gz', 100000)

print(f"Loaded {len(de_embeddings)} German embeddings")
print(f"Loaded {len(it_embeddings)} Italian embeddings")

Loaded 100000 German embeddings
Loaded 100000 Italian embeddings


In [37]:
!wget https://dl.fbaipublicfiles.com/arrival/dictionaries/es-it.0-5000.txt

--2025-01-22 15:06:34--  https://dl.fbaipublicfiles.com/arrival/dictionaries/es-it.0-5000.txt
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 3.167.112.66, 3.167.112.53, 3.167.112.51, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|3.167.112.66|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 87009 (85K) [text/plain]
Saving to: ‘es-it.0-5000.txt’


2025-01-22 15:06:35 (659 KB/s) - ‘es-it.0-5000.txt’ saved [87009/87009]



In [38]:
def load_bilingual_word_list_es_it(file_path):
    bilingual_dict = []
    with open(file_path, 'r', encoding='utf-8') as f:
      for line in f:
          tokens = str(line).strip().split('\t')
          es_word = tokens[0]
          it_word = tokens[1]
          word_pair = (es_word, it_word)
          bilingual_dict.append(word_pair)
    return bilingual_dict

# the deliminator in the dictionary was different ('\t'), so I changed a little the function

es_it_pairs = load_bilingual_word_list_es_it('es-it.0-5000.txt')

print(es_it_pairs[:10])

[('que', 'che'), ('del', 'del'), ('los', 'gli'), ('con', 'con'), ('las', 'le'), ('una', 'una'), ('categoría', 'categoria'), ('para', 'per'), ('como', 'come'), ('fue', 'fu')]


In [39]:
es_vecs, it_vecs = extract_word_embeddings(es_it_pairs, es_embeddings, it_embeddings)

print(f"Extracted {es_vecs.shape[0]} aligned word vectors in Spanish.")
print(f"Extracted {it_vecs.shape[0]} aligned word vectors in Italian.\n")

print(en_vecs[0],"\n")
print(len(en_vecs[0]))

Extracted 4559 aligned word vectors in Spanish.
Extracted 4559 aligned word vectors in Italian.

[-4.08684798e-02  5.84964715e-02 -1.03554558e-02  3.53350304e-02
 -2.71139033e-02  1.67584475e-02  5.45440055e-03 -1.28850332e-02
 -1.43079208e-02 -1.58098573e-03 -8.07093158e-02  4.66390792e-03
  2.03156658e-02 -2.05528131e-03 -4.63228822e-02 -2.98806280e-02
  1.28850332e-02  1.15411952e-02 -6.95633702e-03 -1.39126740e-02
 -6.71918970e-03 -6.16584392e-03 -1.44660193e-02  6.95633702e-03
  1.02764065e-03 -7.41482303e-02  1.09878499e-02  1.17783435e-02
 -3.11454181e-02 -2.32404899e-02  7.43063260e-03 -1.99204199e-02
 -8.22112523e-03 -1.74936071e-01 -1.81022864e-02 -7.03538628e-03
 -2.54538711e-02  6.49785101e-02  1.66003488e-03  2.22918987e-02
  5.69154834e-03 -7.19348527e-03 -2.78253481e-02 -1.40707726e-02
 -5.58087975e-02  4.98010479e-02 -7.27253407e-03 -1.76279899e-02
 -4.42675967e-03  4.07103822e-02 -2.42681298e-02  3.44654880e-02
 -8.69542081e-03 -4.38723527e-02  7.03538628e-03 -5.320016

In [40]:
W = orthogonal_procrustes(es_vecs, it_vecs)

print("Orthogonal mapping matrix learned.")

print(W)
print(len(W))

Orthogonal mapping matrix learned.
[[ 0.00455346  0.07381329  0.02808644 ...  0.0875571  -0.038165
   0.02927761]
 [-0.04569927  0.07067674  0.01982277 ... -0.08923655 -0.05170725
  -0.02111657]
 [ 0.00264618  0.0182407   0.06173337 ...  0.06884508 -0.02480747
   0.03904527]
 ...
 [ 0.00861881  0.09381516  0.00848095 ... -0.03586415  0.0121546
  -0.05274847]
 [ 0.06531589 -0.01833704 -0.00022117 ... -0.08379734 -0.01936114
   0.055649  ]
 [ 0.02092096 -0.01288716  0.03741272 ...  0.00573095  0.06037452
  -0.0322739 ]]
300


In [41]:
aligned_es_embeddings = apply_mapping(es_embeddings, W)

print(f"Aligned {len(aligned_es_embeddings)} Spanish embeddings into the Italian space.\n")
print(aligned_es_embeddings['que'])
print(len(aligned_es_embeddings['que']))

Aligned 100000 Spanish embeddings into the Italian space.

[ 8.65251385e-03 -6.30816817e-02 -7.19834119e-02 -1.57016739e-02
  1.87086724e-02 -3.86627987e-02 -1.98480673e-04 -1.18170762e-02
 -7.03878626e-02 -3.49873491e-02 -1.13263670e-02 -3.20260115e-02
  4.04466279e-02  5.63899167e-02  1.57423429e-02  2.25290563e-02
  1.27524678e-02 -2.09221663e-03  3.59075442e-02 -1.23399310e-04
  3.23215723e-02 -1.38323512e-02 -1.98801104e-02  2.69710496e-02
  2.57578529e-02  3.06262970e-02 -5.63047975e-02  1.08472211e-03
  4.57612276e-02 -1.08298048e-01  1.18853739e-02 -1.05023548e-01
  5.10125794e-02  2.77649276e-02 -5.94361722e-02  2.92340666e-02
  3.03509720e-02 -4.09239158e-02  2.63330359e-02  2.19540261e-02
  3.49285826e-02 -1.33261038e-02 -1.55799249e-02 -8.30226112e-04
 -5.41507676e-02  6.02692515e-02 -4.13471870e-02  4.14201915e-02
  8.74046143e-03  6.57625124e-02  2.57003140e-02 -2.51871049e-02
 -1.10416883e-03  3.52083594e-02  5.86096048e-02 -5.64280823e-02
 -3.66517645e-03  7.22173676e-0

In [42]:
# Finding nearest neighbors
es_word = 'que'
nearest_neighbors = get_nn(es_word, aligned_es_embeddings, it_embeddings, 5)
print(nearest_neighbors)

['che', 'invece', 'non', 'però', 'perché']


In [67]:
it_word = 'donna'
nearest_neighbors = get_nn(it_word, aligned_es_embeddings, it_embeddings, 5)
print(nearest_neighbors)

['regina', 'perfida', 'bionda', 'Callas', 'femme']


In [45]:
#Calculating accuracy
first_es_it_1000_pairs = es_it_pairs[:1000]

accuracy = calculate_accuracy(aligned_es_embeddings, it_embeddings, first_es_it_1000_pairs)
print(f"Accuracy: {accuracy * 100:.2f}%")

Accuracy: 88.80%


In [59]:
# Analogy test
vec_a = norm(aligned_es_embeddings["hombre"])
vec_b = norm(aligned_es_embeddings["mujer"])
vec_c = norm(aligned_es_embeddings["rey"])

nearest_neighbors = get_target_words(aligned_es_embeddings, vec_a, vec_b, vec_c, 5)
print(nearest_neighbors)

['reina', 'rey', 'princesa', 'emperatriz', 'Reina']


In [60]:
# Analogy test
vec_a = norm(aligned_es_embeddings["hombre"])
vec_b = norm(aligned_es_embeddings["mujer"])
vec_c = norm(aligned_es_embeddings["re"]) # king auf Italian

nearest_neighbors = get_target_words(aligned_es_embeddings, vec_a, vec_b, vec_c, 5)
print(nearest_neighbors)

['re', 'ta', 'so', 'ña', 'We']
