<a href="https://colab.research.google.com/github/Saputoa21/ADS_2024_Saputoa/blob/master/Bonus_Exercise_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Bonus Exercises 1: Aligning Multilingual Embedding Spaces**



This notebook represents the first bonus exercises for the lecture Multilingual and Crosslingual Methods and Language Resources (2024W 340168-1). For each successfully completed bonus exercise, a maximum of three points can be achieved that will be added to the points of the final exam. The tasks to be completed in the following notebook are marked with 👋 ⚒.



In this notebook, you will perform and evaluate a supervised method for aligning the embedding spaces of two languages. The examples in the notebook rely on the language pair English-German, however, feel free to change this pair to languages of your choice from the available embeddings and dictionaries (see below).

-----------
## **Preparing the Embeddings and Data**

In this notebook, we will be using fastText embeddings that represents a character-based version of the word2vec skipgram method. Details on the method can be found in the [original publication](https://aclanthology.org/Q17-1010.pdf) and [this website](https://fasttext.cc/).

Pretrained fastText embeddings are available in [157 languages](https://fasttext.cc/docs/en/crawl-vectors.html). The following code cell loads the fastText embeddings for English and German.

👋 ⚒ Please change the following download command if you wish to align other languages than English and German.

Before you decide on a final language pair, please make sure that:
1.   There are pretrained embeddings for this language (see [here](https://fasttext.cc/docs/en/crawl-vectors.html))
2.   There is a bilingual word list available (see the [MUSE GitHub](https://github.com/facebookresearch/MUSE/tree/main) section "Ground-truth bilingual dictionaries")

If the embeddings are available, change the two-digit ISO code in `cc.en.300.vec.g` and `cc.de.300.vec.gz` to the language(s) of your choice.

In [1]:
!wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.vec.gz  # English
!wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.de.300.vec.gz  # German

--2025-01-06 21:30:42--  https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.vec.gz
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 3.163.189.108, 3.163.189.51, 3.163.189.96, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|3.163.189.108|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1325960915 (1.2G) [binary/octet-stream]
Saving to: ‘cc.en.300.vec.gz’


2025-01-06 21:30:53 (118 MB/s) - ‘cc.en.300.vec.gz’ saved [1325960915/1325960915]

--2025-01-06 21:30:53--  https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.de.300.vec.gz
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 3.163.189.108, 3.163.189.51, 3.163.189.96, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|3.163.189.108|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1278030050 (1.2G) [binary/octet-stream]
Saving to: ‘cc.de.300.vec.gz’


2025-01-06 21:31:03 (123 MB/s) - ‘cc.de.300.vec.gz’ saved [127803

### Loading the Embeddings

As a next step we will unzip and load the embeddings. For this alignment task, we will only use the top 100,000 words for both languages to speed up the processing. This choice of only using the top 100,000 words also depends on the lenght of the available bilingual word lists.

In [2]:
import gzip
import numpy as np

def load_fasttext_embeddings(file_path, top_n):
    embeddings = {}
    with gzip.open(file_path, 'rb') as f:
        for i, line in enumerate(f):
            # Line 0 is a header line
            if i > 0 and i <= top_n:
              tokens = line.decode('utf-8').strip().split(' ')
              word = tokens[0]
              vector = np.array(tokens[1:], dtype=np.float32)
              vector = vector / np.linalg.norm(vector)
              embeddings[word] = vector
    return embeddings

# Load the top English and German embeddings for the top 100,000 words (100000)
# FastText sorts the embeddings by decreasing order of word frequency by default
en_embeddings = load_fasttext_embeddings('cc.en.300.vec.gz', 100000)
de_embeddings = load_fasttext_embeddings('cc.de.300.vec.gz', 100000)

print(f"Loaded {len(en_embeddings)} English embeddings")
print(f"Loaded {len(de_embeddings)} German embeddings")


Loaded 100000 English embeddings
Loaded 100000 German embeddings


Let us explore the format of the downloaded and loaded embeddings.

In [3]:
print(f'The loaded embeddings represent a {type(en_embeddings)} datatype.\n')
print(f'Each entry represents the word and the related embedding.\n')
print(f'We can query the word as a key and obtain the embedding, e.g. for good the embedding is {en_embeddings["good"]}.\n')
print(f'The dimensionality of these embeddings corresponds to {len(en_embeddings["good"])}.')

The loaded embeddings represent a <class 'dict'> datatype.

Each entry represents the word and the related embedding.

We can query the word as a key and obtain the embedding, e.g. for good the embedding is [-0.08404064 -0.05785208  0.00155124  0.1233691  -0.05985956  0.00565746
  0.11506542 -0.01505614  0.01587738 -0.00118624 -0.0886031   0.02126109
  0.00912493  0.00419747  0.01450865  0.0062962   0.07829193 -0.01815862
 -0.0549321  -0.02126109  0.01076742  0.07500695  0.01359615  0.00821244
  0.00638745 -0.05867332  0.03056853 -0.01916236  0.0617758   0.0275573
  0.06569952 -0.05192087 -0.03987596  0.00583996  0.04005846  0.05520585
 -0.00556621 -0.11187168 -0.03221102 -0.02463732 -0.01879736  0.0068437
 -0.0062962   0.03312351 -0.03020353  0.05292461  0.00757369 -0.05785208
 -0.05274212  0.00994618 -0.08440564  0.0142349  -0.03722973  0.00611371
 -0.05812582  0.05365461  0.06579077 -0.04918339 -0.13377152 -0.03695598
 -0.02290358 -0.04516842 -0.04763215 -0.06250579  0.04261344  0.0

### Downloading and Loading the Bilingual Word List

To perform this alignment, we will use a bilingual word list that is provided by the Multilingual Unsupervised and Supervised Embeddings (MUSE) project (see [here](https://github.com/facebookresearch/MUSE/tree/main) for all languages).

👋 ⚒ Please change the following downloading command to the language pair of your choice (as long as available on MUSE).


In [4]:
!wget https://dl.fbaipublicfiles.com/arrival/dictionaries/en-de.txt

--2025-01-06 21:32:21--  https://dl.fbaipublicfiles.com/arrival/dictionaries/en-de.txt
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 3.163.189.96, 3.163.189.51, 3.163.189.14, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|3.163.189.96|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1742131 (1.7M) [text/x-c++]
Saving to: ‘en-de.txt’


2025-01-06 21:32:21 (34.1 MB/s) - ‘en-de.txt’ saved [1742131/1742131]



### Creating a bilingual word list

As a next step, we will create a bilingual word list from the donwloaded text file.

👋 ⚒ Create a list of tuples `[(en_word1, de_word1), (en_word2, de_word2),...]`from the downloaded text file in the following code cell. To complete this task, please complement the provided function `load_bilingual_word_list` where it says `Your code here`.

For English-German, the first ten tuples of the list look like this:

```
[('the', 'die'), ('the', 'der'), ('the', 'dem'), ('the', 'den'), ('the', 'das'), ('and', 'sowie'), ('and', 'und'), ('was', 'war'), ('was', 'wurde'), ('for', 'für')]
```



In [15]:
'''
Create a list of tuples that contain word translations

Parameters:
Text file with one bilingual word pair per line

Returns:
A list of tuples that each contains one bilingual word pair
'''
def load_bilingual_word_list(file_path):
    bilingual_dict = []
    with open(file_path, 'r', encoding='utf-8') as f:
      # Your code here
      for line in f:
          tokens = str(line).strip().split(' ')
          en_word = tokens[0]
          de_word = tokens[1]
          word_pair = (en_word, de_word)
          bilingual_dict.append(word_pair)
    return bilingual_dict

# Load English-German word pairs
en_de_pairs = load_bilingual_word_list('en-de.txt')

print(en_de_pairs[:10])

[('the', 'die'), ('the', 'der'), ('the', 'dem'), ('the', 'den'), ('the', 'das'), ('and', 'sowie'), ('and', 'und'), ('was', 'war'), ('was', 'wurde'), ('for', 'für')]


### Getting the Embeddings for our Word List

As a next step, we need to see which words from the word list have a vector representation in the embedding space for both languages and create a list of corresponding embeddings for both languages.


In [28]:
import numpy as np

'''
Function to create a list of word embeddings that is parallel to a bilingual list of words

Parameters:
Bilingual list of words, embeddings in the first language, embeddings in the second language

Returns:
Two numpy arrays of embeddings that correspond two the bilingual word list
'''
def extract_word_embeddings(bilingual_pairs, en_embeddings, de_embeddings):
    en_vecs = []
    de_vecs = []

    for en_word, de_word in bilingual_pairs:
        if en_word in en_embeddings and de_word in de_embeddings:
            en_vecs.append(en_embeddings[en_word])
            de_vecs.append(de_embeddings[de_word])

    # Convert lists to numpy arrays
    en_vecs = np.array(en_vecs)
    de_vecs = np.array(de_vecs)

    return en_vecs, de_vecs

# Extract English and German embeddings for the bilingual lexicon
en_vecs, de_vecs = extract_word_embeddings(en_de_pairs, en_embeddings, de_embeddings)

print(f"Extracted {en_vecs.shape[0]} aligned word vectors in English.")
print(f"Extracted {de_vecs.shape[0]} aligned word vectors in German.\n")

print(de_vecs[0],"\n")
print(len(de_vecs[0]))

Extracted 22546 aligned word vectors in English.
Extracted 22546 aligned word vectors in German.

[-5.72045334e-03  1.09642027e-02  9.52727944e-02 -2.60144435e-02
 -2.86022667e-03  1.80534795e-01  2.54015401e-02  3.55485342e-02
  6.81006350e-03 -9.54089984e-02 -1.16452100e-02 -5.85665507e-03
 -6.66705295e-02 -2.73083560e-02  8.17207620e-03 -1.15771091e-03
 -3.94983683e-03 -9.94269364e-03  1.40968319e-02 -6.12905715e-04
  5.92475571e-03 -3.13262939e-02  6.81006408e-04  7.20504746e-02
 -4.89643589e-02  1.15771091e-03  8.64878111e-03 -3.07814889e-02
 -2.79212627e-03 -8.80541205e-02  4.79428507e-02 -2.42438260e-02
 -3.42546217e-02 -1.54111743e-01  7.49107043e-04  8.98928382e-03
 -4.69894381e-03  7.30038807e-02  1.37699485e-01 -1.71613600e-02
 -6.12905715e-04 -1.25305178e-02 -3.51399295e-02 -4.28353027e-02
 -4.42654174e-03  6.50361106e-02  8.85308348e-03  2.60144435e-02
  1.28710205e-02 -2.56807506e-01  4.76704445e-04 -1.54588455e-02
 -5.24374889e-03 -2.99642817e-03 -3.38460170e-02  4.33801

-----------
## **Embedding Alignment**

We will now use the dictionary and embeddings to align the two vector spaces. The English vector space will be aligned to the German vector space using the [Procrustes](https://en.wikipedia.org/wiki/Orthogonal_Procrustes_problem) alignment method.

Given two matrices, Procrustes finds an orthogonal matrix which most closely maps one input matrix to the other. As a first step, we need to compute this orthogonal transformation matrix.  



In [18]:
"""
Function to perform orthogonal Procrustes alignment to learn a mapping from X to Y.

Parameters:
X (numpy array): Source language word embeddings (English)
Y (numpy array): Target language word embeddings (German)

Returns:
W (numpy array): Orthogonal transformation matrix
"""
def orthogonal_procrustes(X, Y):
    X = X / np.linalg.norm(X, axis=1, keepdims=True)
    Y = Y / np.linalg.norm(Y, axis=1, keepdims=True)

    # Compute matrix product of X^T and Y
    M = np.dot(X.T, Y)

    # Perform Singular Value Decomposition (SVD) on the matrix M
    U, _, Vt = np.linalg.svd(M)

    # Compute the orthogonal transformation matrix W
    W = np.dot(U, Vt)

    return W

W = orthogonal_procrustes(en_vecs, de_vecs)

print("Orthogonal mapping matrix learned.")


Orthogonal mapping matrix learned.


In a second step, the obtained matrix is used to learn an orthogonal mapping of the English vector space to approximate it to the German vector space. Here we can transform the entire vector space of 100,000 embeddings.

In [None]:
"""
Apply the learned orthogonal mapping to the source language embeddings.

Parameters:
embeddings (dict): Source language embeddings (English)
W (numpy array): Orthogonal transformation matrix

Returns:
mapped_embeddings (dict): Transformed embeddings
"""
def apply_mapping(embeddings, W):
    mapped_embeddings = {}
    for word, vec in embeddings.items():
        mapped_vec = np.dot(vec, W)
        # Normalize the mapped vector
        mapped_vec = mapped_vec / np.linalg.norm(mapped_vec)
        mapped_embeddings[word] = mapped_vec
    return mapped_embeddings

aligned_en_embeddings = apply_mapping(en_embeddings, W)

print(f"Aligned {len(aligned_en_embeddings)} English embeddings into the German space.")


Aligned 100000 English embeddings into the German space.


-----------
## **Evaluation**

In this part, you will explore two different tasks for evaluating the final vector space:


1.   Word Translation
2.   Cross-Lingual Analogy Completion



### Word Translation

We will now use the bilingual word list downloaded from MUSE to evaluate the ability of our newly created aligned embedding space to translate words from English to German.

A function that takes an English word as input and ouputs the nearest neighors of the German vector space is already provided for your convenience.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def get_nn(word, aligned_en_embeddings, de_embeddings, top_k):
    print("Nearest neighbors of \"%s\":" % word)
    en_vec = aligned_en_embeddings[word]
    de_words = list(de_embeddings.keys())
    de_vecs = np.array(list(de_embeddings.values()))

    # Compute cosine similarity between the English word vector and all German word vectors
    en_vec = en_vec / np.linalg.norm(en_vec)
    de_vecs_norm = de_vecs / np.linalg.norm(de_vecs, axis=1, keepdims=True)
    similarities = cosine_similarity([en_vec], de_vecs_norm).flatten()

    # Get top_k most similar German words
    nearest_idxs = similarities.argsort()[-top_k:][::-1]
    nearest_words = [de_words[i] for i in nearest_idxs]

    return nearest_words

en_word = 'the'
nearest_neighbors = get_nn(en_word, aligned_en_embeddings, de_embeddings, 5)
print(nearest_neighbors)

👋 ⚒ Use the already downloaded bilingual word list to evaluate the ability of our aligned vector space to translate from English to German. The output of this task should be the **accuracy** calculated on **1000 words** from the word list, i.e., how many of the first 1000 English words result in five German neighbors that correspond to the German translation from the MUSE word list.

Use the provided function `get_nn` to obtain the *k* nearest words in the vector space in German, given an English input word.


In [None]:
# Your code here

### Cross-Lingual Analogy Completion

An analogy compares two related pairs of words, e.g. *man is to woman as king is to queen*. This task can be extended to use analogies for translation, e.g. *man is to woman as Mann ist zu Frau*.


👋 ⚒ Create **twenty** examples of crosslingual analogies and see whether the aligned vector space is able to correctly complete analogies across languages, e.g. positive=(queen, König), negative=(king). You can use examples from the analogy text file in GitHub for this purpose.

Hints:


*   Multilingual Analogies: To create the examples, all you need is a translation of an existing analogy. You can use the already loaded bilingual word list to obtain the translations and the existing analogy list (anlogies.txt on Github) to obtain analogies.
*   Implementation: In the code below, you only need to change the embeddings to the German embeddings for `c` and provide the function with the German embeddings.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def norm(vec):
  return vec / np.linalg.norm(vec)

def get_target_words(embeddings, vec_a, vec_b, vec_c, top_k):
    words = list(embeddings.keys())
    vecs = np.array(list(embeddings.values()))

    # Compute analogy based on input vectors b+c-a (woman+king-man)
    positive = norm(vec_b+vec_c)
    target_vec = norm(positive - vec_a)
    vecs_norm = vecs / np.linalg.norm(vecs, axis=1, keepdims=True)
    similarities = cosine_similarity([target_vec], vecs_norm).flatten()

    # Get top_k most similar words for the retrieved result vector d
    nearest_idxs = similarities.argsort()[-top_k:][::-1]
    nearest_words = [words[i] for i in nearest_idxs]

    return nearest_words

vec_a = norm(aligned_en_embeddings["man"])
vec_b = norm(aligned_en_embeddings["woman"])
vec_c = norm(aligned_en_embeddings["king"])

nearest_neighbors = get_target_words(aligned_en_embeddings, vec_a, vec_b, vec_c, 5)
print(nearest_neighbors)

# Another Language Pair

I have chosen English and Italian, since they have prvoided in Dicitonaries by MUSE project.

I decided to try out with another my fireign language, although I am not so fluent in it.

In [16]:
!wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.vec.gz  # English
!wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.it.300.vec.gz  # Italian

--2025-01-06 21:42:57--  https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.vec.gz
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 3.163.189.51, 3.163.189.96, 3.163.189.14, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|3.163.189.51|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1325960915 (1.2G) [binary/octet-stream]
Saving to: ‘cc.en.300.vec.gz.1’


2025-01-06 21:43:06 (138 MB/s) - ‘cc.en.300.vec.gz.1’ saved [1325960915/1325960915]

--2025-01-06 21:43:06--  https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.ru.300.vec.gz
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 3.163.189.51, 3.163.189.96, 3.163.189.14, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|3.163.189.51|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1306357571 (1.2G) [binary/octet-stream]
Saving to: ‘cc.ru.300.vec.gz’


2025-01-06 21:43:36 (42.1 MB/s) - ‘cc.ru.300.vec.gz’ saved [13063

In [None]:
import gzip
import numpy as np

def load_fasttext_embeddings(file_path, top_n):
    embeddings = {}
    with gzip.open(file_path, 'rb') as f:
        for i, line in enumerate(f):
            # Line 0 is a header line
            if i > 0 and i <= top_n:
              tokens = line.decode('utf-8').strip().split(' ')
              word = tokens[0]
              vector = np.array(tokens[1:], dtype=np.float32)
              vector = vector / np.linalg.norm(vector)
              embeddings[word] = vector
    return embeddings

en_embeddings = load_fasttext_embeddings('cc.en.300.vec.gz', 100000)
it_embeddings = load_fasttext_embeddings('cc.it.300.vec.gz', 100000)

print(f"Loaded {len(en_embeddings)} English embeddings")
print(f"Loaded {len(it_embeddings)} Italian embeddings")

In [None]:
!wget https://dl.fbaipublicfiles.com/arrival/dictionaries/it-en.0-5000.txt

In [None]:
def load_bilingual_word_list(file_path):
    bilingual_dict = []
    with open(file_path, 'r', encoding='utf-8') as f:
      for line in f:
          tokens = str(line).strip().split(' ')
          en_word = tokens[0]
          de_word = tokens[1]
          word_pair = (en_word, de_word)
          bilingual_dict.append(word_pair)
    return bilingual_dict

en_it_pairs = load_bilingual_word_list('en-it.txt')

print(en_it_pairs[:10])