# **Multilingual RAG**

>[Multilingual RAG](#folderId=1Hmk_TzKzxCzy9mvQvu8qxV18I_IzoK_H&updateTitle=true&scrollTo=-wbKFYzJCw-i)

>[Abstract](#folderId=1Hmk_TzKzxCzy9mvQvu8qxV18I_IzoK_H&updateTitle=true&scrollTo=fl9r3YvpCw7M)

>[Implementation](#folderId=1Hmk_TzKzxCzy9mvQvu8qxV18I_IzoK_H&updateTitle=true&scrollTo=CiGYLAwdCw3z)

>>[Setup](#folderId=1Hmk_TzKzxCzy9mvQvu8qxV18I_IzoK_H&updateTitle=true&scrollTo=eHeLrrz0DLwA)

>>[Clients Instantiation](#folderId=1Hmk_TzKzxCzy9mvQvu8qxV18I_IzoK_H&updateTitle=true&scrollTo=Yihhp5cwDOiP)

>>[Utility Functions](#folderId=1Hmk_TzKzxCzy9mvQvu8qxV18I_IzoK_H&updateTitle=true&scrollTo=G89EQf2gTGd-)

>>[Knowledge Base Creation](#folderId=1Hmk_TzKzxCzy9mvQvu8qxV18I_IzoK_H&updateTitle=true&scrollTo=v_sXTnfzDVe5)

>>[Supporting Multilingual Chat Sessions](#folderId=1Hmk_TzKzxCzy9mvQvu8qxV18I_IzoK_H&updateTitle=true&scrollTo=nHzmlBjsPAXt)

>>>[Multilingual Embedding Model](#folderId=1Hmk_TzKzxCzy9mvQvu8qxV18I_IzoK_H&updateTitle=true&scrollTo=eYaimXIWP_-o)

>>>[Query Translation](#folderId=1Hmk_TzKzxCzy9mvQvu8qxV18I_IzoK_H&updateTitle=true&scrollTo=KRolxCXmP_7Y)

>>[Supporting Multilingual Knowledge Bases](#folderId=1Hmk_TzKzxCzy9mvQvu8qxV18I_IzoK_H&updateTitle=true&scrollTo=I2t9Rx9EVkDZ)

>>>[Multilingual Embedding Models](#folderId=1Hmk_TzKzxCzy9mvQvu8qxV18I_IzoK_H&updateTitle=true&scrollTo=nlW7Nk1CVoj1)

>>>[Query Translation](#folderId=1Hmk_TzKzxCzy9mvQvu8qxV18I_IzoK_H&updateTitle=true&scrollTo=HRlFhkxUVogZ)

>>>[Knowledge Base Translation](#folderId=1Hmk_TzKzxCzy9mvQvu8qxV18I_IzoK_H&updateTitle=true&scrollTo=qWEZTGAdVod3)



# Abstract

This notebook provides an explanation of Multilingual retrieval for RAG systems. The notebook demonstrates how to support multilingual queries and knowledge bases using multilingual embedding models and translation.

# Implementation

## Setup

This section installs the following packages:

*   **Chromadb** to store dummy documents for testing purposes.
*   **Sentence Transformers** for embedding the dummy documents.
*   **OpenAI** for answer generation and query translation.



In [None]:
!pip install -q openai chromadb sentence_transformers

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m320.6/320.6 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m526.8/526.8 kB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m171.5/171.5 kB[0m [31m13.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m25.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.0/92.0 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.8/60.8 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.3/41.3 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━

In [None]:
import getpass
import chromadb
import chromadb.utils.embedding_functions as embedding_functions

from openai import OpenAI

In [None]:
# Get OpenAI's API Key
str_openai_key = getpass.getpass("Enter OpenAI API Key: ")

Enter OpenAI API Key: ··········


## Clients Instantiation

This section creates an instance of a Chromadb and an OpenAI clients. It also uses Chromadb's Integration of Sentence Transformers to define two embedding functions: `all-mpnet-base-v2` (English-specific) and its multilingual version `paraphrase-multilingual-mpnet-base-v2`.

In [None]:
chroma_client = chromadb.Client()
openai_client = OpenAI(api_key=str_openai_key)

monolingual_ef = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="all-mpnet-base-v2")
multilingual_ef = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="paraphrase-multilingual-mpnet-base-v2")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/4.13k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/723 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/402 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.08M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

## Utility Functions

In [None]:
def translate(str_query, str_language="English"):
  """Translates the given textual query into English using GPT-3.5-Turbo"""
  response = openai_client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
      {
        "role": "user",
        "content": f"Translate the following query into {str_language}:\n{str_query}"
      }
    ]
  )

  return response.choices[0].message.content

## Knowledge Base Creation

In [None]:
# Create the collections
collection_monolingual_kb_monolingual_ef = chroma_client.create_collection(
    name="monolingual_kb_monolingual_ef",
    embedding_function=monolingual_ef
    )
collection_monolingual_kb_multilingual_ef = chroma_client.create_collection(
    name="monolingual_kb_multilingual_ef",
    embedding_function=multilingual_ef
    )
collection_multilingual_kb_monolingual_ef = chroma_client.create_collection(
    name="multilingual_kb_monolingual_ef",
    embedding_function=monolingual_ef
    )
collection_multilingual_kb_multilingual_ef = chroma_client.create_collection(
    name="multilingual_kb_multilingual_ef",
    embedding_function=multilingual_ef
    )

In [None]:
# Dummy documents for testing purposes
lst_monolingual_documents = [
    "The sky is blue.",
    "Trees provide shade.",

    "Birds sing in the morning.",
    "Cats purr when they are happy.",

    "Coffee is a popular morning beverage.",
    "Exercise is good for your health."
]

lst_multilingual_documents = [
    "The sky is blue.",
    "Gli alberi forniscono ombra.",

    "Gli uccelli cantano al mattino.",
    "Cats purr when they are happy.",

    "Il caffè è una bevanda popolare al mattino.",
    "Exercise is good for your health."
]

In [None]:
# Populating the collections
collection_monolingual_kb_monolingual_ef.add(
    documents=lst_monolingual_documents,
    ids=[f"id_{int_index}"
         for int_index in range(len(lst_monolingual_documents))]
)

collection_monolingual_kb_multilingual_ef.add(
    documents=lst_monolingual_documents,
    ids=[f"id_{int_index}"
         for int_index in range(len(lst_monolingual_documents))]
)

collection_multilingual_kb_monolingual_ef.add(
    documents=lst_multilingual_documents,
    ids=[f"id_{int_index}"
         for int_index in range(len(lst_multilingual_documents))]
)

collection_multilingual_kb_multilingual_ef.add(
    documents=lst_multilingual_documents,
    ids=[f"id_{int_index}"
         for int_index in range(len(lst_multilingual_documents))]
)

## Supporting Multilingual Chat Sessions

This section defines two natural language queries: one in English and the other in Italian. The goal is to support retrieval for both language on a knowledge base that contains all-English documents.

It shows that the retrieval with Mpnet obviously works for the English query. However, for the italian one, it misses the desired document.

In [None]:
str_english_query = "What are the benefits of trees?"
str_italian_query = "Quali sono i benefici degli alberi?"

In [None]:
# Retrieval with the English Query and the monolingual embedding model
lst_english_query_retrieval = collection_monolingual_kb_monolingual_ef.query(
    query_texts=str_english_query, n_results=1)["documents"]

# Retrieval with the Italian Query and the monolingual embedding model
lst_italian_query_retrieval = collection_monolingual_kb_monolingual_ef.query(
    query_texts=str_italian_query, n_results=1)["documents"]

print(f"Retrieved documents for the English query: {lst_english_query_retrieval}")
print(f"Retrieved documents for the Italian query: {lst_italian_query_retrieval}")

Retrieved documents for the English query: [['Trees provide shade.']]
Retrieved documents for the Italian query: [['Coffee is a popular morning beverage.']]


### Multilingual Embedding Model

One way to support the Italian query is to use the multilingual version of the embedding model.

In [None]:
# Retrieval with the Italian Query and the multilingual embedding model
lst_italian_query_retrieval = collection_monolingual_kb_multilingual_ef.query(
    query_texts=str_italian_query, n_results=1)["documents"]

print(f"Retrieved documents for the Italian query: {lst_italian_query_retrieval}")

Retrieved documents for the Italian query: [['Trees provide shade.']]


### Query Translation

Alternatively, the Italian query can be translated to English and the monolingual embedding model can then be used to obtain the same result.

In [None]:
# Translate the Italian query into English
str_translated_query = translate(str_italian_query)

# Retrieval with the translated query and the monolingual embedding model
lst_italian_query_retrieval = collection_monolingual_kb_monolingual_ef.query(
    query_texts=str_translated_query, n_results=1)["documents"]

print(f"Original Italian Query:\t{str_italian_query}")
print(f"Translated Query:\t{str_translated_query}")
print(f"\nRetrieved documents for the translated query: {lst_italian_query_retrieval}")

Original Italian Query:	Quali sono i benefici degli alberi?
Translated Query:	What are the benefits of trees?

Retrieved documents for the translated query: [['Trees provide shade.']]


## Supporting Multilingual Knowledge Bases

So far in the notebook, only the monolingual knowledge base has been used. The goal of this section is to support retrieval when the knowledge base itself contains documents in different languages.

As shown in this section, the retrieval for the English query failed, even with an English embedding model, due to the document being in Italian. Meanwhile, the Italian query matched the correct document despite the embedding model not being Italian.

In [None]:
# Retrieval with the English Query and the monolingual embedding model
lst_english_query_retrieval = collection_multilingual_kb_monolingual_ef.query(
    query_texts=str_english_query, n_results=1)["documents"]

# Retrieval with the Italian Query and the monolingual embedding model
lst_italian_query_retrieval = collection_multilingual_kb_monolingual_ef.query(
    query_texts=str_italian_query, n_results=1)["documents"]

print(f"Retrieved documents for the English query: {lst_english_query_retrieval}")
print(f"Retrieved documents for the Italian query: {lst_italian_query_retrieval}")

Retrieved documents for the English query: [['Exercise is good for your health.']]
Retrieved documents for the Italian query: [['Gli alberi forniscono ombra.']]


### Multilingual Embedding Models

The issue is resolved once a multilingual embedding model is used for the English query.

In [None]:
# Retrieval with the English Query and the multilingual embedding model
lst_english_query_retrieval = collection_multilingual_kb_multilingual_ef.query(
    query_texts=str_italian_query, n_results=1)["documents"]

print(f"Retrieved documents for the English query: {lst_english_query_retrieval}")

Retrieved documents for the English query: [['Gli alberi forniscono ombra.']]


### Query Translation

In [None]:
lst_languages_in_kb = ["English", "Italian"]
str_spanish_query = "¿Cuáles son los beneficios de los árboles?"

for str_language in lst_languages_in_kb:
  str_translated_query = translate(str_spanish_query, str_language=str_language)


### Knowledge Base Translation

Translating the entire knowledge base simplifies the problem into just having to support multilingual chat sessions, which has already been covered above.

In [None]:
lst_translated_documents = []
for str_document in lst_multilingual_documents:
  lst_translated_documents.append(translate(str_document))

collection_translated_monolingual_kb = chroma_client.create_collection(
    name="multilingual_kb_multilingual_ef",
    embedding_function=multilingual_ef
    )

collection_translated_monolingual_kb.add(
    documents=lst_translated_documents,
    ids=[f"id_{int_index}"
         for int_index in range(len(lst_multilingual_documents))]
)

# Print translated knowledge base