# Cross-Lingual Word Embedding Alignment

## Step 01: Data Preparation

## 1-a: pre-trained FastText embeddings

### Download pre-trained FastText embeddings

Pre-trained monolingual FastText word embedding models for English (`cc.en.300.bin.gz`) and Hindi (`cc.hi.300.bin.gz`) have been downloaded from [FastText](https://fasttext.cc/docs/en/crawl-vectors.html). These models, provided by Facebook AI, contain 300-dimensional word vectors trained on Common Crawl data. After downloading, the `.gz` files are decompressed using `gunzip` to obtain the binary `.bin` files for further use.

In [None]:
!wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.bin.gz

In [None]:
!wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.hi.300.bin.gz

--2025-04-04 12:27:28--  https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.hi.300.bin.gz
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 108.158.20.120, 108.158.20.43, 108.158.20.111, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|108.158.20.120|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4371554972 (4.1G) [application/octet-stream]
Saving to: ‘cc.hi.300.bin.gz’


2025-04-04 12:30:43 (21.5 MB/s) - ‘cc.hi.300.bin.gz’ saved [4371554972/4371554972]



In [None]:
!gunzip cc.en.300.bin.gz
!gunzip cc.hi.300.bin.gz

### Install Required Libraries

`fasttext`: For loading and working with pre-trained FastText word embedding models.

`torch`, `torchvision`, and `torchaudio`: libraries from the PyTorch ecosystem, used here to handle the computation effectively, the installation uses the CUDA 11.8-compatible wheel index for GPU acceleration support.

`scikit-learn`: Provides tools for evaluation metrics and nearest neighbor search.

`scipy`: Offers efficient linear algebra utilities.

`matplotlib`: Used for visualizing aligned embeddings and similarity graphs.

In [None]:
!pip install fasttext
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Collecting fasttext
  Downloading fasttext-0.9.3.tar.gz (73 kB)
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[?25hCollecting pybind11>=2.2 (from fasttext)
  Using cached pybind11-2.13.6-py3-none-any.whl.metadata (9.5 kB)
Using cached pybind11-2.13.6-py3-none-any.whl (243 kB)
Building wheels for collected packages: fasttext
  Building wheel for fasttext (pyproject.toml) ... [?25ldone
[?25h  Created wheel for fasttext: filename=fasttext-0.9.3-cp310-cp310-linux_x86_64.whl size=4296259 sha256=e2d3767aaf115e28f77b9e935910af8badc36d92b557a6b8f3acc2eaac09eedb
  Stored in directory: /root/.cache/pip/wheels/0d/a2/00/81db54d3e6a8199b829d58e02cec2ddb20ce3e59fad8d3c92a
Successfully built fasttext
Installing collected packages: pybind11, fasttext
Successfully installed fasttext-0.9.3 pybind11-2.13.6
[0mLooking in indexes: https://download.pytorch.org/whl/cu118
[0m

In [None]:
!pip3 install -U scikit-learn scipy matplotlib

Collecting scikit-learn
  Using cached scikit_learn-1.6.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (18 kB)
Collecting scipy
  Using cached scipy-1.15.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
Collecting matplotlib
  Using cached matplotlib-3.10.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Collecting joblib>=1.2.0 (from scikit-learn)
  Using cached joblib-1.4.2-py3-none-any.whl.metadata (5.4 kB)
Collecting threadpoolctl>=3.1.0 (from scikit-learn)
  Using cached threadpoolctl-3.6.0-py3-none-any.whl.metadata (13 kB)
Collecting contourpy>=1.0.1 (from matplotlib)
  Using cached contourpy-1.3.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.4 kB)
Collecting cycler>=0.10 (from matplotlib)
  Using cached cycler-0.12.1-py3-none-any.whl.metadata (3.8 kB)
Collecting fonttools>=4.22.0 (from matplotlib)
  Downloading fonttools-4.57.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014

### Load and Use Pre-trained FastText Model

Now, we load the pre-trained FastText English word embedding model (cc.en.300.bin) and retrieves the 300-dimensional vector for the word "king". FastText can generate embeddings even for out-of-vocabulary (OOV) words using subword information. The execution time is also measured to give an idea of loading and inference speed. Current inference time is 3.9783 seconds.

In [None]:
import fasttext
import time

# Start timer
start_time = time.time()

# Load pre-trained binary model
model_en = fasttext.load_model("cc.en.300.bin")

# Get vector for a word (works even for OOV words!)
vec = model_en.get_word_vector("king")

# End timer
end_time = time.time()

# Print results
print(vec)
print("Vector length:", len(vec))
print(f"Execution time: {end_time - start_time:.4f} seconds")

[-2.63642855e-02 -4.38338369e-02 -5.22461310e-02  2.49765869e-02
  1.59946546e-01  4.98980191e-03  2.51637166e-03 -1.62712112e-02
 -6.62135556e-02 -1.67888845e-03 -1.39499649e-01 -5.72493225e-02
 -1.45975351e-01 -1.56568401e-02  3.75731173e-03  8.14326331e-02
  9.02080238e-02 -6.22668210e-03 -1.21208653e-01  8.42568502e-02
  6.83858395e-02  1.01658493e-01 -5.07243127e-02  9.16049480e-02
  5.08386921e-03  6.28780201e-02  5.67676872e-02  1.91132650e-01
  4.35085818e-02  1.80901110e-01 -1.74744725e-02  7.06654340e-02
 -6.06337450e-02  3.89074199e-02  1.44602428e-03 -1.25214964e-01
  8.63592885e-03 -7.98915625e-02 -1.00960366e-01  4.66771051e-02
  5.39167747e-02  4.82006092e-03 -2.03307956e-01 -1.17739499e-01
 -1.37199834e-01 -4.92817685e-02 -1.87217459e-01 -7.17959851e-02
 -1.86646730e-02 -9.93231237e-02 -5.15213236e-02 -1.93316743e-01
 -8.94939303e-02 -1.71539113e-01 -1.03669807e-01 -7.04649240e-02
  1.29511207e-01  5.56146055e-02 -4.56965044e-02 -7.34248012e-03
  6.97860867e-02  1.69947

## 1-B: Limiting vocabulary

###  Limit vocabulary to the top 100,000 most frequent words in each language: Prepare FastText Embedding Matrix

Here, we extracts the top 100,000 most frequent English words from the FastText model and retrieves their corresponding 300-dimensional word vectors. These vectors are then converted into a PyTorch tensor (`embedding_matrix_en`) and moved to the GPU for efficient computation.

In [None]:
import torch

top_100k_words_en = model_en.get_words()[:100000]
vectors_en = [model_en.get_word_vector(w) for w in top_100k_words_en]
embedding_matrix_en = torch.tensor(vectors_en, dtype=torch.float32).cuda()  # Move to GPU

  embedding_matrix_en = torch.tensor(vectors_en, dtype=torch.float32).cuda()  # Move to GPU


Further, we retrieve the 300-dimensional FastText vector for the word `"king"`, converts it into a PyTorch tensor, and moves it to the GPU for accelerated computation. Now, The execution time is measured and found to be 0.0028 seconds (from 3.9783 seconds as we have seen last)

In [None]:
import time

# Start timer
start_time = time.time()

query_word = "king"
query_vec = torch.tensor(model_en.get_word_vector(query_word), dtype=torch.float32).cuda()
query_vec

end_time = time.time()

# Print results
print(query_vec)
print("Vector length:", len(query_vec))
print(f"Execution time: {end_time - start_time:.4f} seconds")

tensor([-2.6364e-02, -4.3834e-02, -5.2246e-02,  2.4977e-02,  1.5995e-01,
         4.9898e-03,  2.5164e-03, -1.6271e-02, -6.6214e-02, -1.6789e-03,
        -1.3950e-01, -5.7249e-02, -1.4598e-01, -1.5657e-02,  3.7573e-03,
         8.1433e-02,  9.0208e-02, -6.2267e-03, -1.2121e-01,  8.4257e-02,
         6.8386e-02,  1.0166e-01, -5.0724e-02,  9.1605e-02,  5.0839e-03,
         6.2878e-02,  5.6768e-02,  1.9113e-01,  4.3509e-02,  1.8090e-01,
        -1.7474e-02,  7.0665e-02, -6.0634e-02,  3.8907e-02,  1.4460e-03,
        -1.2521e-01,  8.6359e-03, -7.9892e-02, -1.0096e-01,  4.6677e-02,
         5.3917e-02,  4.8201e-03, -2.0331e-01, -1.1774e-01, -1.3720e-01,
        -4.9282e-02, -1.8722e-01, -7.1796e-02, -1.8665e-02, -9.9323e-02,
        -5.1521e-02, -1.9332e-01, -8.9494e-02, -1.7154e-01, -1.0367e-01,
        -7.0465e-02,  1.2951e-01,  5.5615e-02, -4.5697e-02, -7.3425e-03,
         6.9786e-02,  1.6995e-01,  3.1033e-02,  5.9152e-02, -8.9157e-02,
         1.0405e-01, -5.0823e-02,  1.5698e-01,  1.3

Similarly, we extracts the top 100,000 most frequent hindi words from the FastText model and retrieves their corresponding 300-dimensional word vectors. These vectors are then converted into a PyTorch tensor (`embedding_matrix_hi`) and moved to the GPU for efficient computation. We retrieve the 300-dimensional FastText vector for the word `"राजा"`, converts it into a PyTorch tensor, and moves it to the GPU for accelerated computation. Now, The execution time is measured and found to be 0.0014 seconds.

In [None]:
model_hi = fasttext.load_model("cc.hi.300.bin")
top_100k_words_hi = model_hi.get_words()[:100000]

vectors_hi = [model_hi.get_word_vector(w) for w in top_100k_words_hi]
embedding_matrix_hi = torch.tensor(vectors_hi, dtype=torch.float32).cuda()  # Move to GPU

In [None]:
import time

# Start timer
start_time = time.time()

query_word = "राजा"
query_vec = torch.tensor(model_hi.get_word_vector(query_word), dtype=torch.float32).cuda()
query_vec

end_time = time.time()

# Print results
print(query_vec)
print("Vector length:", len(query_vec))
print(f"Execution time: {end_time - start_time:.4f} seconds")

tensor([-0.0381,  0.0658, -0.0262, -0.0013,  0.0302,  0.0290, -0.0197, -0.0188,
         0.0217, -0.0263,  0.0129, -0.0857,  0.0192,  0.0464,  0.0132, -0.0430,
         0.0247, -0.0350,  0.0430,  0.0828,  0.0459,  0.0189,  0.0074, -0.0555,
        -0.0606,  0.0246,  0.0334, -0.0684, -0.0458, -0.0366,  0.0253,  0.0053,
         0.0383,  0.0292, -0.0343,  0.0736,  0.0115, -0.0532,  0.0301,  0.0354,
         0.0369,  0.0095,  0.0501, -0.0206, -0.0202,  0.1286, -0.0876,  0.0126,
         0.0399, -0.0640, -0.0162, -0.0055,  0.0123,  0.0489, -0.0661, -0.0215,
         0.0391,  0.0033, -0.0087,  0.0350, -0.0118, -0.1038, -0.0528,  0.0834,
        -0.0562, -0.0765, -0.0211,  0.0343,  0.0010, -0.0471,  0.0162,  0.0221,
         0.0354,  0.0382,  0.1104,  0.0478, -0.0202, -0.0301,  0.0247, -0.0402,
         0.0616, -0.0026, -0.0198, -0.0566, -0.0233, -0.0236, -0.0028, -0.0192,
         0.0483, -0.1813, -0.0385,  0.0091,  0.0518,  0.1070, -0.0480, -0.0135,
         0.0309,  0.0436, -0.0243, -0.04

### Find Top-N Similar Words Using Cosine Similarity on GPU
The function, `find_similar_gpu`, retrieves the `top-N` (here, N=5) most similar words to a given query word (e.g., `"king"`, `"राजा"`) using cosine similarity. It has following process:

1.   Checks if the query word exists in the top-100k vocabulary.
2.   Computes the vector for the query word and moves it to the GPU.
3.   Calculates cosine similarity between the query vector and all vectors in the embedding matrix (on GPU).
4.   Returns the top-N most similar words, excluding the query word itself.



This approach is optimized for fast retrieval using GPU acceleration, hence the computation is very fast.

In [None]:
def find_similar_gpu(query_word, top_words, embedding_matrix, top_n=5):
    if query_word not in top_words:
        print(f"{query_word} not in top-k vocab.")
        return []

    # Convert query to vector (still CPU, small cost)
    query_vec = torch.tensor(model_en.get_word_vector(query_word), dtype=torch.float32).cuda()

    # Compute cosine similarity (GPU)
    sim = torch.nn.functional.cosine_similarity(query_vec.unsqueeze(0), embedding_matrix, dim=1)

    # Top N indices (excluding itself)
    topk = sim.topk(top_n + 1)
    similar = [(top_words[i], sim[i].item()) for i in topk.indices if top_words[i] != query_word]

    return similar[:top_n]

print(find_similar_gpu("king", top_100k_words_en, embedding_matrix_en, top_n=5))

[('kings', 0.7550358176231384), ('queen', 0.7068519592285156), ('King', 0.6591265201568604), ('prince', 0.6495252847671509), ('monarch', 0.6183921098709106)]


In [None]:
def find_similar_gpu(query_word, top_words, embedding_matrix, top_n=5):
    if query_word not in top_words:
        print(f"{query_word} not in top-k vocab.")
        return []

    # Convert query to vector (still CPU, small cost)
    query_vec = torch.tensor(model_hi.get_word_vector(query_word), dtype=torch.float32).cuda()

    # Compute cosine similarity (GPU)
    sim = torch.nn.functional.cosine_similarity(query_vec.unsqueeze(0), embedding_matrix, dim=1)

    # Top N indices (excluding itself)
    topk = sim.topk(top_n + 1)
    similar = [(top_words[i], sim[i].item()) for i in topk.indices if top_words[i] != query_word]

    return similar[:top_n]

print(find_similar_gpu("राजा", top_100k_words_hi, embedding_matrix_hi, top_n=5))

[('प्रजा', 0.5774529576301575), ('राजाओं', 0.57224041223526), ('महाराजा', 0.5474495887756348), ('महाराजाओं', 0.5429219603538513), ('रानी', 0.5357876420021057)]


## 1-c: English-Hindi Bilingual Lexicon

### Load English-Hindi Bilingual Lexicon

Now, we download a pre-compiled bilingual dictionary (`en-hi.txt`) containing
38221 English-Hindi word pairs from the [MUSE project](https://github.com/facebookresearch/MUSE). The `load_bilingual_lexicon` function reads the file line by line and stores each English-Hindi word pair as a tuple in a list.

In [None]:
!wget https://dl.fbaipublicfiles.com/arrival/dictionaries/en-hi.txt

--2025-04-04 12:38:34--  https://dl.fbaipublicfiles.com/arrival/dictionaries/en-hi.txt
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 108.158.20.21, 108.158.20.43, 108.158.20.120, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|108.158.20.21|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 930856 (909K) [text/x-c++]
Saving to: ‘en-hi.txt’


2025-04-04 12:38:35 (1.28 MB/s) - ‘en-hi.txt’ saved [930856/930856]



In [None]:
!wget https://dl.fbaipublicfiles.com/arrival/dictionaries/en-hi.txt

def load_bilingual_lexicon(file_path):
    lexicon = []
    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            en_word, hi_word = line.strip().split()
            lexicon.append((en_word, hi_word))
    return lexicon

bilingual_lexicon = load_bilingual_lexicon("en-hi.txt")
bilingual_lexicon

[('and', 'और'),
 ('was', 'था'),
 ('was', 'थी'),
 ('for', 'लिये'),
 ('that', 'उस'),
 ('that', 'कि'),
 ('with', 'साथ'),
 ('from', 'से'),
 ('from', 'इससे'),
 ('this', 'ये'),
 ('this', 'यह'),
 ('this', 'इस'),
 ('utc', 'यूटीसी'),
 ('utc', 'utc'),
 ('his', 'उसकी'),
 ('his', 'उसका'),
 ('his', 'उसके'),
 ('not', 'नही'),
 ('not', 'नहीं'),
 ('are', 'हैं'),
 ('talk', 'बात'),
 ('which', 'जिससे'),
 ('also', 'भी'),
 ('has', 'रै'),
 ('were', 'यहूद'),
 ('but', 'परन्तु'),
 ('but', 'लेकिन'),
 ('but', 'लेकीन'),
 ('but', 'मगर'),
 ('but', 'लकिन'),
 ('one', 'एक'),
 ('new', 'नया'),
 ('new', 'नई'),
 ('first', 'प्रथम'),
 ('first', 'पहली'),
 ('first', 'पहले'),
 ('first', 'पहला'),
 ('page', 'पृष्ठ'),
 ('page', 'पेज'),
 ('you', 'आपको'),
 ('you', 'आप'),
 ('you', 'तुम'),
 ('they', 'उन्होंने'),
 ('they', 'वे'),
 ('had', 'था'),
 ('article', 'लेख'),
 ('article', 'आलेख'),
 ('who', 'जिसने'),
 ('who', 'कौन'),
 ('who', 'जो'),
 ('all', 'सभी'),
 ('all', 'सब'),
 ('their', 'उनकी'),
 ('their', 'इनकी'),
 ('their', 'उनका'),
 ('th

## Step 02: Embedding Alignment

### 2-a: Extract Aligned Embedding Pairs from Bilingual Lexicon

This function, `get_aligned_embeddings`, takes the bilingual lexicon along with source and target FastText models (English and Hindi in this case), and does following:

- Retrieves the corresponding word vectors for each English-Hindi word pair.
- Skips any word not found in the respective vocabulary.
- Returns two NumPy arrays: one for source (English) embeddings and one for target (Hindi) embeddings.

These aligned embeddings (`X_src` and `Y_tgt`) are typically used for supervised cross-lingual mapping or evaluation tasks.

In [None]:
import numpy as np

def get_aligned_embeddings(lexicon, model_src, model_tgt):
    src_vecs = []
    tgt_vecs = []
    for src_word, tgt_word in lexicon:
        try:
            src_vec = model_src.get_word_vector(src_word)
            tgt_vec = model_tgt.get_word_vector(tgt_word)
            src_vecs.append(src_vec)
            tgt_vecs.append(tgt_vec)
        except KeyError:
            # Ignore words not in vocabulary
            continue
    return np.array(src_vecs), np.array(tgt_vecs)

X_src, Y_tgt = get_aligned_embeddings(bilingual_lexicon, model_en, model_hi)

In [None]:
X_src, Y_tgt

(array([[ 0.00823911, -0.08990277,  0.02652529, ..., -0.01159137,
         -0.04112864,  0.03625222],
        [-0.00102113,  0.04602651,  0.00604868, ..., -0.08220465,
          0.01135671,  0.00796232],
        [-0.00102113,  0.04602651,  0.00604868, ..., -0.08220465,
          0.01135671,  0.00796232],
        ...,
        [-0.02027114, -0.02240974, -0.01055593, ...,  0.04405471,
         -0.02558288,  0.00770057],
        [-0.03047501, -0.06205894, -0.03088871, ...,  0.04897757,
          0.01192696, -0.05418908],
        [ 0.04086032, -0.03539041, -0.00380384, ...,  0.02875789,
         -0.02330048, -0.04979837]], dtype=float32),
 array([[ 0.01132702, -0.07065436,  0.01902812, ...,  0.03552558,
         -0.00812045, -0.02838861],
        [-0.0096623 , -0.01804674,  0.06464785, ..., -0.11392715,
         -0.09173848, -0.01580131],
        [-0.04787989, -0.08300597, -0.04615875, ..., -0.00028435,
          0.15839534,  0.0260338 ],
        ...,
        [-0.03423707,  0.03619131, -0.0

### 2-b: Procrustes Alignment (Orthogonal Mapping):

Now, we implement the **Procrustes alignment** technique to learn a linear mapping from the source (English) embedding space to the target (Hindi) space using the aligned word pairs from the bilingual lexicon.

Given two aligned embedding matrices:

- $ X \in \mathbb{R}^{n \times d} $: Source embeddings (English)  
- $ Y \in \mathbb{R}^{n \times d} $: Target embeddings (Hindi)

We aim to find an **orthogonal matrix** $ W \in \mathbb{R}^{d \times d} $ that minimizes the Frobenius norm:

$ \min_W \| XW - Y \|_F \quad \text{subject to} \quad W^\top W = I $

This ensures that distances and angles between vectors are preserved during the transformation.

In [None]:
def procrustes_alignment(X, Y):
    """
    Solves for the orthogonal matrix W that best maps X to Y (minimizing ||XW - Y||).
    """
    # Centering is optional; FastText embeddings are usually zero-centered enough
    # X -= X.mean(axis=0)
    # Y -= Y.mean(axis=0)

    # Compute matrix product
    M = X.T @ Y

    # SVD
    U, _, Vt = np.linalg.svd(M)

    # Orthogonal mapping
    W = U @ Vt
    return W

Implementation Details

- The matrix $M = X^\top Y$ is computed.
- We perform Singular Value Decomposition (SVD) on $M$:  
  $M = U \Sigma V^\top$
- The optimal orthogonal mapping is then:
  $ W = UV^\top $

The learned matrix $ W $ is used to align all top-100k English word embeddings by:
$ \text{Aligned_English_Embeddings} = \text{Original_English_Embeddings} \cdot W $

The resulting aligned embedding matrix is moved back to GPU as a PyTorch tensor for further similarity computations.

In [None]:
W = procrustes_alignment(X_src, Y_tgt)

# Apply to top-100k English embeddings
aligned_embedding_matrix_en = embedding_matrix_en.cpu().numpy() @ W
aligned_embedding_matrix_en = torch.tensor(aligned_embedding_matrix_en, dtype=torch.float32).cuda()

In [None]:
aligned_embedding_matrix_en

tensor([[ 0.1328, -0.0356,  0.1568,  ...,  0.0843, -0.0254, -0.1301],
        [ 0.0111, -0.0394,  0.0522,  ..., -0.0038,  0.0824, -0.2147],
        [ 0.1528,  0.0331,  0.3174,  ...,  0.0601, -0.2201, -0.1145],
        ...,
        [-0.0345,  0.0214, -0.0257,  ..., -0.0096,  0.0347,  0.0053],
        [-0.0402,  0.0102,  0.0621,  ...,  0.0196,  0.0892, -0.0470],
        [ 0.0410, -0.0254, -0.0205,  ...,  0.0864,  0.0278, -0.0296]],
       device='cuda:0')

## Step 03: Evaluation

## 3-a,b: Perform word translation from English to Hindi using the aligned embeddings and Evaluate on MUSE Test data

### Load MUSE Test dictionary:
Now we download the **English-Hindi test dictionary** (`en-hi.5000-6500.txt`) from the [MUSE](https://github.com/facebookresearch/MUSE) dataset, which contains 2032 bilingual word pairs reserved for evaluation.

The `load_muse_dictionary` function reads each line and stores the aligned word pairs as a list of tuples. This dictionary will be commonly used to benchmark the quality of learned cross-lingual embedding mappings.

In [None]:
!wget https://dl.fbaipublicfiles.com/arrival/dictionaries/en-hi.5000-6500.txt

--2025-04-04 12:55:49--  https://dl.fbaipublicfiles.com/arrival/dictionaries/en-hi.5000-6500.txt
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 108.158.20.43, 108.158.20.111, 108.158.20.21, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|108.158.20.43|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 52464 (51K) [text/plain]
Saving to: ‘en-hi.5000-6500.txt’


2025-04-04 12:55:50 (362 KB/s) - ‘en-hi.5000-6500.txt’ saved [52464/52464]



In [None]:
def load_muse_dictionary(file_path):
    word_pairs = []
    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            src, tgt = line.strip().split()
            word_pairs.append((src, tgt))
    return word_pairs

muse_test_dict = load_muse_dictionary("en-hi.5000-6500.txt")

### Normalize Embeddings and Prepare Lookup Dictionaries

To evaluate translation quality using cosine similarity, this step performs the following:

- **Normalization**:  
  Both the aligned English embeddings and original Hindi embeddings are L2-normalized to ensure cosine similarity is equivalent to dot product.

- **Index Mappings**:  
  - `word2idx_en`: Maps English words to their index in the top-100k vocabulary.  
  - `word2idx_hi`: Maps Hindi words to their index.  
  - `idx2word_hi`: Reverse lookup for Hindi word indices, useful for retrieving predicted translations.

These pre-processed structures are essential for fast and accurate nearest neighbor search during evaluation.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

# Normalize both embeddings for cosine similarity
def normalize_embeddings(embeddings):
    norms = embeddings.norm(dim=1, keepdim=True)
    return embeddings / norms

normalized_en = normalize_embeddings(aligned_embedding_matrix_en)
normalized_hi = normalize_embeddings(embedding_matrix_hi)

In [None]:
word2idx_en = {word: idx for idx, word in enumerate(top_100k_words_en)}
word2idx_hi = {word: idx for idx, word in enumerate(top_100k_words_hi)}
idx2word_hi = {idx: word for word, idx in word2idx_hi.items()}

## 3-c: Evaluate Cross-Lingual Translation Quality

This function evaluates the **word translation accuracy** using the aligned English embeddings and the original Hindi embeddings on the MUSE test set.

#### Method: `evaluate_translation(..)` function

- For each word pair `(en_word, hi_word)` in the test dictionary:
  - Retrieve the L2-normalized English vector and compute cosine similarity with all Hindi embeddings using a dot product.
  - Retrieve the **Top-K most similar Hindi words**.
  - Count how often the correct Hindi translation is among the top-K predictions.

#### Metrics Computed:
- **Precision@1**: Fraction of test pairs where the correct translation is ranked **first**.
- **Precision@5**: Fraction where it appears in the **top 5** predictions.


In [None]:
def evaluate_translation(test_dict, top_k=[1, 5]):
    total = 0
    correct_at_k = {k: 0 for k in top_k}
    sim_scores = []

    for en_word, hi_word in test_dict:
        if en_word not in word2idx_en or hi_word not in word2idx_hi:
            continue

        total += 1
        en_idx = word2idx_en[en_word]
        en_vec = normalized_en[en_idx].unsqueeze(0)  # [1, 300]
        sims = torch.matmul(en_vec, normalized_hi.T).squeeze(0)  # [100k]
        topk_indices = torch.topk(sims, max(top_k)).indices.cpu().tolist()
        predictions = [idx2word_hi[i] for i in topk_indices]

        for k in top_k:
            if hi_word in predictions[:k]:
                correct_at_k[k] += 1

        sim_scores.append((en_word, hi_word, sims[word2idx_hi[hi_word]].item()))

    precision_at_k = {f'Precision@{k}': correct_at_k[k] / total for k in top_k}
    return precision_at_k, sim_scores

In [None]:
precision_metrics, similarity_scores = evaluate_translation(muse_test_dict)

print("Translation Precision Metrics:")
for k, v in precision_metrics.items():
    print(f"{k}: {v:.4f}")

Translation Precision Metrics:
Precision@1: 0.3337
Precision@5: 0.5938


## 3-d: Inspect the Translation Pairs and calculate cosine similarity

Now, we use the English-Hindi word pairs from the MUSE test dictionary by their **cosine similarity score**, in descending order.

#### Output:
- Displays the **Top 10 most confidently aligned word pairs** according to the model.
- Helps qualitatively assess the success of the Procrustes alignment by observing semantically strong matches.

In [None]:
similarity_scores.sort(key=lambda x: -x[2])  # sort by descending similarity

# Top 10 most similar pairs
print("\nTop 10 Most Similar Pairs (Cosine Score):")
for en, hi, sim in similarity_scores[:10]:
    print(f"{en} → {hi} | Cosine Similarity: {sim:.4f}")


Top 10 Most Similar Pairs (Cosine Score):
pollution → प्रदूषण | Cosine Similarity: 0.7005
clothes → कपड़े | Cosine Similarity: 0.6994
healthy → स्वस्थ | Cosine Similarity: 0.6879
kilometers → किलोमीटर | Cosine Similarity: 0.6792
visa → वीजा | Cosine Similarity: 0.6735
transparent → पारदर्शी | Cosine Similarity: 0.6733
ideology → विचारधारा | Cosine Similarity: 0.6715
mature → परिपक्व | Cosine Similarity: 0.6682
investments → निवेश | Cosine Similarity: 0.6666
bag → बैग | Cosine Similarity: 0.6655


## 3-e: Ablation Study: Effect of Lexicon Size

We evaluate how the size of the bilingual lexicon impacts alignment quality. Procrustes alignment is trained using subsets of size **5,000**, **10,000**, **15,000**, **20,000**, **25,000**, **30,000**, and **35,000** word pairs from the English-Hindi dictionary.

For each size, we compute **Precision@1** and **Precision@5** on the fixed MUSE test set.

In [None]:
def evaluate_translation(
    test_dict,
    top_k,
    normalized_en,
    normalized_hi,
    word2idx_en,
    word2idx_hi,
    idx2word_hi
):
    total = 0
    correct_at_k = {k: 0 for k in top_k}
    sim_scores = []

    for en_word, hi_word in test_dict:
        if en_word not in word2idx_en or hi_word not in word2idx_hi:
            continue

        total += 1
        en_idx = word2idx_en[en_word]
        en_vec = normalized_en[en_idx].unsqueeze(0)  # [1, 300]
        sims = torch.matmul(en_vec, normalized_hi.T).squeeze(0)  # [Vocab_size]
        topk_indices = torch.topk(sims, max(top_k)).indices.cpu().tolist()
        predictions = [idx2word_hi[i] for i in topk_indices]

        for k in top_k:
            if hi_word in predictions[:k]:
                correct_at_k[k] += 1

        sim_scores.append((en_word, hi_word, sims[word2idx_hi[hi_word]].item()))

    precision_at_k = {f'Precision@{k}': correct_at_k[k] / total for k in top_k}
    return precision_at_k, sim_scores


def run_ablation_experiment(lexicon_path, muse_test_path, sizes=[5000, 10000, 20000]):
    results = []

    # Precompute fixed values
    muse_dict = load_muse_dictionary(muse_test_path)
    word2idx_en = {word: idx for idx, word in enumerate(top_100k_words_en)}
    word2idx_hi = {word: idx for idx, word in enumerate(top_100k_words_hi)}
    idx2word_hi = {idx: word for word, idx in word2idx_hi.items()}
    normalized_hi = normalize_embeddings(embedding_matrix_hi)

    for size in sizes:
        # Step 1: Load lexicon subset
        bilingual_lexicon = load_bilingual_lexicon(lexicon_path, max_pairs=size)

        # Step 2: Extract aligned embeddings
        X_src, Y_tgt = get_aligned_embeddings(bilingual_lexicon, model_en, model_hi)

        # Step 3: Learn Procrustes mapping
        W = procrustes_alignment(X_src, Y_tgt)

        # Step 4: Align and normalize English embeddings
        aligned_en = embedding_matrix_en.cpu().numpy() @ W
        normalized_en = normalize_embeddings(torch.tensor(aligned_en, dtype=torch.float32).cuda())

        # Step 5: Evaluate
        precision, _ = evaluate_translation(
            muse_dict,
            top_k=[1, 5],
            normalized_en=normalized_en,
            normalized_hi=normalized_hi,
            word2idx_en=word2idx_en,
            word2idx_hi=word2idx_hi,
            idx2word_hi=idx2word_hi
        )

        results.append((size, precision))

    return results


ablation_results = run_ablation_experiment(
    lexicon_path="en-hi.txt",
    muse_test_path="en-hi.5000-6500.txt",
    sizes=[5000, 10000, 20000, 25000, 30000, 35000]
)

for size, metrics in ablation_results:
    print(f"\n--- Running experiment with {size} bilingual pairs ---")
    for k, v in metrics.items():
        print(f"{k}: {v:.4f}")


--- Running experiment with 5000 bilingual pairs ---
Precision@1: 0.2256
Precision@5: 0.4550

--- Running experiment with 10000 bilingual pairs ---
Precision@1: 0.3456
Precision@5: 0.6019

--- Running experiment with 20000 bilingual pairs ---
Precision@1: 0.3538
Precision@5: 0.6162

--- Running experiment with 25000 bilingual pairs ---
Precision@1: 0.3394
Precision@5: 0.6038

--- Running experiment with 30000 bilingual pairs ---
Precision@1: 0.3400
Precision@5: 0.6012

--- Running experiment with 35000 bilingual pairs ---
Precision@1: 0.3319
Precision@5: 0.5994


## Optional Extra Credit Task: unsupervised alignment method such as Cross-Domain Similarity Local Scaling (CSLS) combined with adversarial training

### Convert `.bin` to `.vec` Format for MUSE

The FastText binary models (`.bin`) for English and Hindi are converted to the text-based `.vec` format using `fasttext.load_model()` and saved manually. This format is required for compatibility with the [MUSE](https://github.com/facebookresearch/MUSE) toolkit.

The conversion enables efficient use of **unsupervised alignment methods** such as **adversarial training** combined with **Cross-domain Similarity Local Scaling (CSLS)**.

In [None]:
import fasttext

def convert_bin_to_vec(bin_path, vec_path):
    model = fasttext.load_model(bin_path)
    with open(vec_path, 'w', encoding='utf-8') as f:
        words = model.get_words()
        dim = len(model.get_word_vector(words[0]))
        f.write(f"{len(words)} {dim}\n")
        for w in words:
            vec = model.get_word_vector(w)
            vec_str = " ".join([str(x) for x in vec])
            f.write(f"{w} {vec_str}\n")

convert_bin_to_vec("cc.en.300.bin", "cc.en.300.vec")
convert_bin_to_vec("cc.hi.300.bin", "cc.hi.300.vec")

### Set Up Environment and Clone MUSE Repository

1. **Create a Python 3.10 virtual environment** for isolating dependencies.
2. **Activate the virtual environment** (note: `source` won't work directly in Colab cells - use further steps on local, for better practice.).
3. **Clone the [MUSE](https://github.com/facebookresearch/MUSE) repository** from Facebook Research.
4. **Install FAISS with GPU support** for efficient nearest neighbor search during alignment.

In [None]:
!python3.10 -m venv muse-venv
!source /workspace/lipsync/LatentSync/expssssss/muse-venv/bin/activate
!git clone https://github.com/facebookresearch/MUSE.git
%cd MUSE
!pip install faiss-gpu

Fix for the getargspec issue in MUSE (File to edit): `MUSE/src/utils.py`

Locate this line (around line 218 in `utils.py`):
`expected_args = inspect.getargspec(optim_fn.__init__)[0]`

Replace it with: `expected_args = list(inspect.signature(optim_fn.__init__).parameters)`

### Run Unsupervised Embedding Alignment with MUSE

Execute the MUSE `unsupervised.py` script to align English and Hindi word embeddings using adversarial training followed by ** refinement iterations** and **Cross-Domain Similarity Local Scaling (CSLS)**.

**Breakdown:**
- `--src_lang`: Source language = English
- `--tgt_lang`: Target language = Hindi
- `--src_emb`: Path to the FastText `.vec` file for English
- `--tgt_emb`: Path to the FastText `.vec` file for Hindi
- `--n_refinement`: Perform 5 rounds of Procrustes refinement after initial adversarial mapping
- `--export "pth"`: Export the learned mapping as PyTorch model weights
- `--dico_eval`: Evaluate the alignment using a gold standard bilingual dictionary

Now we leverages **unsupervised adversarial training combined with CSLS** to align two embedding spaces without requiring parallel data. While the training process appeared successful and reached an optimal stage, the evaluation results were unsatisfactory. Specifically, both **Precision@1** and **Precision@5** were **0.0**.

To diagnose the issue, I experimented with different test dictionaries and also switched to **FastText embeddings trained on Wikipedia**. However, the results remained poor, indicating a deeper issue with alignment quality, compatibility between the embeddings and evaluation data or environmental setup. I believe with more debug and exploration, i'll be able to solve this issue.

In [None]:
!python unsupervised.py --src_lang en --tgt_lang hi --src_emb /data/cc.en.300.vec --tgt_emb /data/cc.hi.300.vec --n_refinement 5 --export "pth" --dico_eval /data/en-hi.5000-6500.txt

INFO - 04/04/25 14:52:20 - 0:00:00 - adversarial: True
                                     batch_size: 32
                                     cuda: True
                                     dico_build: S2T
                                     dico_eval: /workspace/lipsync/LatentSync/expssssss/en-hi.5000-6500.txt
                                     dico_max_rank: 15000
                                     dico_max_size: 0
                                     dico_method: csls_knn_10
                                     dico_min_size: 0
                                     dico_threshold: 0
                                     dis_clip_weights: 0
                                     dis_dropout: 0.0
                                     dis_hid_dim: 2048
                                     dis_input_dropout: 0.1
                                     dis_lambda: 1
                                     dis_layers: 2
                                     dis_most_frequent: 75000
           

In [None]:
!python unsupervised.py --src_lang en --tgt_lang hi --src_emb /data/wiki.en.vec --tgt_emb /data/wiki.hi.vec --n_refinement 5 --export "pth" --dico_eval /data/en-hi.5000-6500.txt

INFO - 04/04/25 15:24:12 - 0:00:00 - adversarial: True
                                     batch_size: 32
                                     cuda: True
                                     dico_build: S2T
                                     dico_eval: /workspace/lipsync/LatentSync/expssssss/en-hi.5000-6500.txt
                                     dico_max_rank: 15000
                                     dico_max_size: 0
                                     dico_method: csls_knn_10
                                     dico_min_size: 0
                                     dico_threshold: 0
                                     dis_clip_weights: 0
                                     dis_dropout: 0.0
                                     dis_hid_dim: 2048
                                     dis_input_dropout: 0.1
                                     dis_lambda: 1
                                     dis_layers: 2
                                     dis_most_frequent: 75000
           

In [None]:
!python unsupervised.py \
  --src_lang en --tgt_lang hi \
  --src_emb /data/wiki.en.vec --tgt_emb /data/wiki.hi.vec \
  --normalize_embeddings center \
  --map_id_init 0 \
  --n_refinement 10 \
  --dico_method csls_knn_10 \
  --dico_build "S2T|T2S" \
  --dico_eval /data/en-hi.5000-6500.txt \
  --dico_min_size 500 --dico_max_size 5000 \
  --dis_layers 3 --dis_hid_dim 1024 --dis_input_dropout 0.2 --dis_steps 10 \
  --export pth

INFO - 04/04/25 16:14:09 - 0:00:00 - adversarial: True
                                     batch_size: 32
                                     cuda: True
                                     dico_build: S2T|T2S
                                     dico_eval: /workspace/lipsync/LatentSync/expssssss/en-hi.5000-6500.txt
                                     dico_max_rank: 15000
                                     dico_max_size: 5000
                                     dico_method: csls_knn_10
                                     dico_min_size: 500
                                     dico_threshold: 0
                                     dis_clip_weights: 0
                                     dis_dropout: 0.0
                                     dis_hid_dim: 1024
                                     dis_input_dropout: 0.2
                                     dis_lambda: 1
                                     dis_layers: 3
                                     dis_most_frequent: 75000
  