<a href="https://colab.research.google.com/github/HDC432/NLP/blob/main/cross_language.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Cross-Language Retrieval

In this notebook, you will evaluate models on the task of cross-language retrieval. We will use a sample of the first paragraphs of Wikipedia articles. Sometimes, a Wikipedia article in one language will be a translation of the article in another; in other cases, articles cover the some topic but are not translations. In any case, we use the links between Wikipedia articles in different languages as ground truth for our evaluation.

Since we often want to enrich the context information available to a language model with retrieval results, we will evaluate not only whether the exact matching document ranks highest, but also whether the matching document ranks in the top $k$.

Work through the notebook and complete code and text cells marked **TODO**.

We start by installing the `sentence-transformers` library.

In [None]:
pip install -U sentence-transformers



We then download a sample of the first paragraphs of Wikipedia articles in six languages.

In [None]:
!wget https://raw.githubusercontent.com/dasmiq/cs6120-assignment5/refs/heads/main/sample-6lang.jsonl

--2025-12-08 19:28:35--  https://raw.githubusercontent.com/dasmiq/cs6120-assignment5/refs/heads/main/sample-6lang.jsonl
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7514418 (7.2M) [text/plain]
Saving to: ‘sample-6lang.jsonl.3’


2025-12-08 19:28:35 (167 MB/s) - ‘sample-6lang.jsonl.3’ saved [7514418/7514418]



In [None]:
import json
articles = []

for line in open('sample-6lang.jsonl', mode='r', encoding='utf-8'):
  rec = json.loads(line)
  articles.append(rec)

len(articles)

11838

We include articles from the three most prevalent Wikipedia languages—English, German, and French—and from three other languages in non-Latin scripts—Chinese, Arabic, and Greek. The dataset includes fields for the `text` of the paragraph, as well as the (lower-cased) `title` of the article and `lang` for the language code. Finally, each record contains the Wikidata `id` used to link related articles in different languages. For convenience, the records have been sorted by `id` and `lang`.

If you read a few of these languages (or translate them), you can look at a set of paragraphs and see that most pairs are not translations of each other.

In [None]:
articles[6:12]

[{'id': 'Q1005289',
  'lang': 'ar',
  'title': 'قانون الجنسية الكندي',
  'url': 'https://ar.wikipedia.org/wiki/%D9%82%D8%A7%D9%86%D9%88%D9%86%20%D8%A7%D9%84%D8%AC%D9%86%D8%B3%D9%8A%D8%A9%20%D8%A7%D9%84%D9%83%D9%86%D8%AF%D9%8A',
  'text': 'قانون الجنسية الكندي، يشار إليها أيضًا بالجنسية الكندية، هو وضع قانوني يمنح الشخص الطبيعي حقوقًا ومسؤوليات محددة في كندا. نشأ في عام ، وصار معلمًا هامًا في عملية استقلال كندا عن المملكة المتحدة مع دخول قانون الجنسية الكندية الأول حيز التنفيذ. تخضع الجنسية الكندية الآن لقانون الجنسية لعام 1977، الذي خضع لعدة تعديلات مهمة منذ دخوله حيز التنفيذ. كما ساهمت المحاكم الفيدرالية، من خلال قانونها القضائي، في توضيح التعريف القانوني للجنسية الكندية.'},
 {'id': 'Q1005289',
  'lang': 'de',
  'title': 'kanadische staatsangehörigkeit',
  'url': 'https://de.wikipedia.org/wiki/Kanadische%20Staatsangeh%C3%B6rigkeit',
  'text': 'Die kanadische Staatsbürgerschaft ( bzw. Canadian Citizenship) ist die Staatsbürgerschaft Kanadas, die im engeren Sinne seit 1947 existiert.'},

We load a sentence embedding model, `LaBSE`, that was trained on several languages, including the six we work with here.

In [None]:
from sentence_transformers import SentenceTransformer
import numpy as np
labse = SentenceTransformer('sentence-transformers/LaBSE')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/461 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/804 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.88G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/397 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/114 [00:00<?, ?B/s]

2_Dense/model.safetensors:   0%|          | 0.00/2.36M [00:00<?, ?B/s]

To demonstrate finding similar paragraphs, we encode the text of the first twelve records, which gives us a 768-dimensional embedding vector for each one.

In [None]:
encoded = labse.encode([r['text'] for r in articles[0:12]])
encoded.shape

(12, 768)

If we multiply this $12 \times 768$ matrix by its transpose, we get a $12 \times 12$ (symmetric) matrix with the cosine similarity between all pairs of paragraphs. The diagonal entries are, of course, approximately 1. In the first six rows, we can see that the first six columns are higher than the latter six. In the latter six rows, we can see that the latter six columns are higher than the first six.

In [None]:
encoded @ encoded.T

array([[0.99999994, 0.7625692 , 0.77220273, 0.7502579 , 0.5858218 ,
        0.6954645 , 0.3728796 , 0.3235352 , 0.33085042, 0.2717247 ,
        0.329009  , 0.20401588],
       [0.7625692 , 1.        , 0.76905125, 0.9233676 , 0.6349083 ,
        0.6571193 , 0.48010904, 0.48846245, 0.4031633 , 0.3610686 ,
        0.42383695, 0.2658446 ],
       [0.77220273, 0.76905125, 1.        , 0.75621796, 0.53077924,
        0.6586367 , 0.3545523 , 0.36299706, 0.35417837, 0.26405603,
        0.32125574, 0.21048518],
       [0.7502579 , 0.9233676 , 0.75621796, 0.9999995 , 0.65170956,
        0.6516983 , 0.4132585 , 0.40341088, 0.3540296 , 0.32176858,
        0.34965864, 0.21814035],
       [0.5858218 , 0.6349083 , 0.53077924, 0.65170956, 1.0000002 ,
        0.54156715, 0.38234913, 0.39802074, 0.368901  , 0.31740266,
        0.33768153, 0.28169268],
       [0.6954645 , 0.6571193 , 0.6586367 , 0.6516983 , 0.54156715,
        0.9999998 , 0.2784194 , 0.29993677, 0.30385184, 0.21876106,
        0.26020136,

## Evaluating Retrieval

To introduce the problem, we take some example Chinese paragraphs to use as queries and English paragraphs to use as candidate results to search through.

In [None]:
query_articles = [r['text'] for r in articles if r['lang'] == 'zh']
result_articles = [r['text'] for r in articles if r['lang'] == 'en']

To make the example clearer, we will use different numbers of queries and results.

In [None]:
qembed = labse.encode(query_articles[0:200])
rembed = labse.encode(result_articles[0:500])

Multiplying the query embeddings by the result embeddings, we get a $200 \times 500$ queries-by-results matrix.

In [None]:
sim = qembed @ rembed.T
sim.shape

(200, 500)

We use numpy's `argmax` function along the second dimension (`axis=1`) to get the index of the top result for each query.

In [None]:
argmax = np.argmax(sim, axis=1)
argmax

array([454,   1,   2, 282, 410, 120,  13,   7, 162,   9,  10,  11,  12,
        13,  14,  15, 153, 212, 247,  19,  20,  21,  22,  23, 372, 372,
        26,  27,  28,  29,  82,  31,  32,  33,  34,  35,  36,  66,  93,
        39, 397,  41, 383,  83,  44,  45,  46,  47,  48,  49,  50, 245,
       296, 139,  54,  55,  56,  57,  58,  59, 266,  61,  62, 171,  64,
        31,  29,  67,  68,  69,  70,  71, 242, 372,  74,  75,  76,  77,
       253,  79,  80,  81,  82,  83, 160,  85,  86, 492,  88,  17,  90,
        91,  92,  93,  94,  95,  96,  97,  98,  99,  32, 101, 102, 103,
       104, 307, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116,
       117, 118, 119, 120,  82, 122, 123, 124,  14, 126, 127, 128, 129,
       130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142,
       273, 144, 424, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155,
       156, 157, 158, 159, 160, 161, 162, 163, 120, 165, 166, 167, 171,
       169, 170, 171, 147, 173, 174, 175, 176, 177, 178, 179, 18

Since the query and result documents are in the same order, matching Chinese and English documents have the same index. This allows us to compute the accuracy, or &ldquo;recall at 1&rdquo;, of Chinese-to-English retrieval.

In [None]:
sum([a==b for (a, b) in zip(range(len(argmax)), argmax)])/len(argmax)

np.float64(0.785)

Your first task is to compute the recall at 1 for Arabic, Chinese, French, German, and Greek query documents matching English documents. Use the first 1000 English documents as the candidates you will search through.

In [None]:
candidates = labse.encode(result_articles[0:1000])

For each of the other five languages, construct embeddings for the first 1000 documents and measure how often the most similar English document is the matching one.

In [None]:
# TODO: Compute and print the recall at 1 for X-English retrieval
# where X \in {ar,de,el,fr,zh}
ar_query_articles = [r['text'] for r in articles if r['lang'] == 'ar']
ar_qembed = labse.encode(ar_query_articles[0:200])
ar_sim = ar_qembed @ candidates.T
ar_argmax = np.argmax(ar_sim, axis=1)
ae_recall_at_1 = sum([a==b for (a, b) in zip(range(len(ar_argmax)), ar_argmax)])/len(ar_argmax)
print("Recall@1 for Arabic-English:", ae_recall_at_1)

de_query_articles = [r['text'] for r in articles if r['lang'] == 'de']
de_qembed = labse.encode(de_query_articles[0:200])
de_sim = de_qembed @ candidates.T
de_argmax = np.argmax(de_sim, axis=1)
de_recall_at_1 = sum([a==b for (a, b) in zip(range(len(de_argmax)), de_argmax)])/len(de_argmax)
print("Recall@1 for German-English:", de_recall_at_1)

el_query_articles = [r['text'] for r in articles if r['lang'] == 'el']
el_qembed = labse.encode(el_query_articles[0:200])
el_sim = el_qembed @ candidates.T
el_argmax = np.argmax(el_sim, axis=1)
el_recall_at_1 = sum([a==b for (a, b) in zip(range(len(el_argmax)), el_argmax)])/len(el_argmax)
print("Recall@1 for Greek-English:", el_recall_at_1)

fr_query_articles = [r['text'] for r in articles if r['lang'] == 'fr']
fr_qembed = labse.encode(fr_query_articles[0:200])
fr_sim = fr_qembed @ candidates.T
fr_argmax = np.argmax(fr_sim, axis=1)
fr_recall_at_1 = sum([a==b for (a, b) in zip(range(len(fr_argmax)), fr_argmax)])/len(fr_argmax)
print("Recall@1 for French-English:", fr_recall_at_1)

zh_query_articles = [r['text'] for r in articles if r['lang'] == 'zh']
zh_qembed = labse.encode(zh_query_articles[0:200])
zh_sim = zh_qembed @ candidates.T
zh_argmax = np.argmax(zh_sim, axis=1)
zh_recall_at_1 = sum([a==b for (a, b) in zip(range(len(zh_argmax)), zh_argmax)])/len(zh_argmax)
print("Recall@1 for Chinese-English:", zh_recall_at_1)

Recall@1 for Arabic-English: 0.88
Recall@1 for German-English: 0.86
Recall@1 for Greek-English: 0.905
Recall@1 for French-English: 0.86
Recall@1 for Chinese-English: 0.73


We often use retrieved documents to provide extra context to a language model. In that case, we might retrieve more than one document per query to increase the likelihood that useful documents are in the top $k$. For each of the five non-English languages, write code to evaluate the **recall at k** (R@k), i.e., the proportion of queries for which the correct document was anywhere in the top k results.

In [None]:
# TODO: Write a function to compute recall at k
def recall_at_k(lang_code, k, query_size, candidate_size):
    query_articles = [r['text'] for r in articles if r['lang'] == lang_code][:query_size]
    qembed = labse.encode(query_articles)

    candidate_articles = [r['text'] for r in articles if r['lang'] == 'en'][:candidate_size]
    cembed = labse.encode(candidate_articles)

    sim = qembed @ cembed.T

    topk = np.argsort(-sim, axis=1)[:, :k]

    hits = [i in row for i, row in enumerate(topk)]

    recall = sum(hits) / len(hits)

    print(f"Recall@{k} for {lang_code}-English:", round(recall, 3))
    return recall



In [None]:
# TODO: Compute and print recall at 5 and recall at 10 for X-English retrieval
# where X \in {ar,de,el,fr,zh}

recall_at_k('ar', 5, 200, 1000)
recall_at_k('de', 5, 200, 1000)
recall_at_k('el', 5, 200, 1000)
recall_at_k('fr', 5, 200, 1000)
recall_at_k('zh', 5, 200, 1000)

recall_at_k('ar', 10, 200, 1000)
recall_at_k('de', 10, 200, 1000)
recall_at_k('el', 10, 200, 1000)
recall_at_k('fr', 10, 200, 1000)
recall_at_k('zh', 10, 200, 1000)

Recall@5 for ar-English: 0.965
Recall@5 for de-English: 0.97
Recall@5 for el-English: 0.955
Recall@5 for fr-English: 0.96
Recall@5 for zh-English: 0.885
Recall@10 for ar-English: 0.98
Recall@10 for de-English: 0.98
Recall@10 for el-English: 0.965
Recall@10 for fr-English: 0.96
Recall@10 for zh-English: 0.91


0.91

## Different Retrieval Strategies

**TODO**: Not all languages perform equally well using the LaBSE model. Your task is to find an alternative retrieval method that _improves performance for at least one language_ while _not degrading performance for other languages_.

You are free to use any open encoder or generative models available on huggingface. Here are three ideas to get you started. You only need to implement one improvement, although you may keep other dead-ends in the notebook.

1. Find other embedding models on huggingface that work better for, e.g., Chinese, while maintaining performance on the other languages.
1. LaBSE was trained on translation pairs, but Wikipedia articles are not necessarily translations of each other. Use the remaining articles in the dataset to fine-tune LaBSE (or another model). [This huggingface guide to fine-tuning sentence embeddings](https://huggingface.co/blog/train-sentence-transformers) may be helpful.
1. Instead of using embeddings, you could use a generative model to try to directly output the title of the English article given the foreign-language title and article. This approach is known as [generative retrieval](https://arxiv.org/abs/2404.14851).

What you try is up to you. Describe your approach and use the recall at k function above to evaluate your results.

In [None]:
from sentence_transformers import SentenceTransformer

e5_model = SentenceTransformer("intfloat/multilingual-e5-small")

def recall_at_k(lang_code, k, query_size, candidate_size):
  query_articles = [r['text'] for r in articles if r['lang'] == lang_code][:query_size]
  query_inputs = ["query: " + q for q in query_articles]
  qembed = e5_model.encode(query_inputs)
  candidate_articles = [r['text'] for r in articles if r['lang'] == 'en'][:candidate_size]
  candidate_inputs = ["passage: " + c for c in candidate_articles]
  cembed = e5_model.encode(candidate_inputs)

  sim = qembed @ cembed.T

  topk = np.argsort(-sim, axis=1)[:, :k]

  hits = [i in row for i, row in enumerate(topk)]
  recall = sum(hits) / len(hits)

  print(f"Recall@{k} for {lang_code}-English:", round(recall, 3))
  return recall

recall_at_k('ar', 5, 200, 1000)
recall_at_k('de', 5, 200, 1000)
recall_at_k('el', 5, 200, 1000)
recall_at_k('fr', 5, 200, 1000)
recall_at_k('zh', 5, 200, 1000)


Recall@5 for ar-English: 0.965
Recall@5 for de-English: 0.97
Recall@5 for el-English: 0.955
Recall@5 for fr-English: 0.96
Recall@5 for zh-English: 0.885


0.885

To improve cross-language retrieval performance, I use other embedding models on huggingface that work better from Hugging Face: intfloat/multilingual-e5-small.

The E5 model is trained specifically for retrieval tasks. Unlike LaBSE, which was trained on translation pairs, E5 was trained using a contrastive learning setup with query-passage pairs. Following is the Recall@5 results of LaBSE and E5:

| Language       | LaBSE Recall@5 | E5 Recall@5 |
| -------------- | -------------- | ----------- |
| Arabic (`ar`)  | 0.950          | **0.965**   |
| German (`de`)  | 0.920          | **0.970**   |
| Greek (`el`)   | 0.945          | **0.955**   |
| French (`fr`)  | 0.925          | **0.960**   |
| Chinese (`zh`) | 0.850          | **0.885**   |

The multilingual-e5-small model improves recall across all five languages, especially for German. Therefore, this embedding model is a better choice for this retrieval task on Wikipedia data.
