In [1]:
#| output: false
#| code-fold: true
#| code-summary: "Install prerequisite libraries for colab"
! pip install -qq datasets
! pip install -Uqq weaviate-client

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/480.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━[0m [32m419.8/480.6 kB[0m [31m12.5 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/116.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/179.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m179.3/179.3 kB[0m [31m11.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━

In [2]:
#| output: false
#| code-fold: true
#| code-summary: "Import required libraries"
from datasets import load_dataset
import weaviate
from weaviate.classes.init import Auth
import weaviate.classes as wvc
from google.colab import userdata
from tqdm.auto import tqdm
import google.generativeai as genai
genai.configure(api_key=userdata.get("GOOGLE_API_KEY"))

## How adept are current LLMs in low-resource languages?

One thing that bothered me about the current state of commercial LLMs is how much their performance lagged for low-resource languages. **Bengali** is once such language, which also happens to be my first/native language. While low-resource, it is certainly not lacking in speakers: Bengali has the fifth highest number of native speakers of any language in the world. Let's look at an example of Gemini translating English to Bengali:

In [3]:
prompt = "I may not be able to get this done by the deadline. Can I please get an extension?"

In [4]:
zero_shot_prompt = f"""
তুমি একজন সহায়ক সহকারী যে ইংরেজি থেকে বাংলা অনুবাদ করে।

নিম্নলিখিত প্রম্পটটি ইংরেজি থেকে বাংলায় অনুবাদ করো:

ইংরেজি:
{prompt}
বাংলা:
"""

In [5]:
model = genai.GenerativeModel("gemini-1.5-flash")
zs_response = model.generate_content(zero_shot_prompt)
print(zs_response.text)

সময়মতো কাজটি শেষ করতে পারবো না। কিছুটা সময় বাড়িয়ে দেওয়ার জন্য অনুরোধ করছি। 



For those of you unfamiliar with Bengali, the translation is *correct* in the sense that it technically conveys the important information. However, the *style* leaves much to be desired, and doesn't read like anything any native Bengali speaker would write.

In this article, we will take a look at whether few-shot prompting can increase the performance of translation. Rather than hard-coding example translations, which would be rather time- and energy-intensive, it would also be besides the point for those using the translation capability due to unfamiliarity with the language. Therefore, we will retrieve the few-shot examples from a vector database of translation examples, given the prompt.

## Constructing vector database for example lookups

To construct the vector database, we will be using the ~1000 english-to-bengali translation examples from Cohere's [`CohereForAI/aya_collection`](https://huggingface.co/datasets/CohereForAI/aya_collection) dataset. In the code cell below, we fetch the dataset, filter out the bengali examples, and then pre-process the examples to remove the translation instructions.



In [None]:
#| output: false
# Load an English to Bengali translation dataset from Aya Collection
dataset = load_dataset("CohereForAI/aya_collection", "templated_indic_sentiment")['train']
dataset = dataset.filter(lambda example: example['language'] == 'ben')

dataset = dataset.map(lambda ex: {
    "from": ex["inputs"][ex["inputs"].find(": \"")+3:-1],
    "to": ex["targets"][1:-1],
})

dataset = dataset.select_columns(["from", "to"])

README.md:   0%|          | 0.00/72.3k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/3.00M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/11559 [00:00<?, ? examples/s]

Filter:   0%|          | 0/11559 [00:00<?, ? examples/s]

Map:   0%|          | 0/1156 [00:00<?, ? examples/s]

Now, we use Google's embedding API to generate embeddings for all the english prompts. If you are following along, this code cell might take quite a while to finish. You could definitely use other embeddings models, even local ones (you can get a list of the great embedding models in the [MTEB Leaderboard](https://huggingface.co/spaces/mteb/leaderboard)). I used Google's embeddings simply because it's free, has a simple API, and allows me to get away with a free CPU colab instance to experiment with longer.

In [None]:
#| output: false
translation_objs = list()

for data in tqdm(dataset):
    google_vec = genai.embed_content(
        model="models/text-embedding-004",
        content=data["from"],
        task_type="retrieval_document",
        title="Embedding of single string"
    )

    translation_objs.append(wvc.data.DataObject(
        properties={
            "en": data["from"],
            "bn": data["to"],
        },
        vector=google_vec['embedding'],
    ))

  0%|          | 0/1156 [00:00<?, ?it/s]

Next, I store the embeddings in a weaviate cloud collection aptly named `Translations`. In creating the collection, I specify that I won't be needing a vectorizer since I'll be supplying the vectors myself.

In [8]:
#| output: false
client = weaviate.connect_to_weaviate_cloud(
    cluster_url=userdata.get("WEAVIATE_URL"),
    auth_credentials=Auth.api_key(userdata.get("WEAVIATE_API")),
)

In [9]:
#| output: false
# Check if connection to cloud was successfully established
client.is_ready()

True

In [None]:
#| output: false
client.collections.create(
    "Translations",
    vectorizer_config=wvc.config.Configure.Vectorizer.none(),
)

Finally, let's upload the generated embeddings to the collection:

In [None]:
#| output: false
translations = client.collections.get("Translations")
translations.data.insert_many(translation_objs)

BatchObjectReturn(_all_responses=[UUID('9a12a42a-689e-413d-8bd6-6dd8f640f183'), UUID('14d96c96-a48a-44d2-8ed0-07c962adacb9'), UUID('060159e3-1d94-4a3b-970e-ab2cdbfb5d28'), UUID('5b69c781-5980-403c-850c-d76d6508f095'), UUID('f07fed46-3335-41be-b03f-80f3878809ac'), UUID('613b811b-e0b3-4ffa-a45d-dbb991408535'), UUID('92e34f3f-1754-4fe6-aa55-da09b0b95a3e'), UUID('ea7c2a31-6ea7-409d-a001-db2e4c98816b'), UUID('004e5bb9-083d-4b1c-a805-ea59da529e53'), UUID('80ce924e-374d-442c-a60d-585fb0cc54fd'), UUID('54e9423c-ce48-4484-85c1-f8301e025d4b'), UUID('cf75ae28-c5df-42c9-a1aa-a3b04fb33012'), UUID('86ed0dea-3a69-4230-ae6d-393addb24e6d'), UUID('1c814b11-000a-48ac-9f0c-6b601aa87eca'), UUID('87248780-94ee-4a0b-89a6-36a76b95f6e7'), UUID('3377ec06-50fe-4ed2-adb1-c0ea6e935c93'), UUID('a5425109-5b7c-440f-a6c4-25083d6390b3'), UUID('d7233a40-0e32-4b18-babc-eb26fca07360'), UUID('892013f6-d852-4d82-949a-bd0fbf46dc7f'), UUID('5cec1edd-c646-4c70-8f96-cfc242a18272'), UUID('48cdd36d-b3fa-4f86-a336-6c025fbe652c'), 

## Generating translations with retrieved examples

To make an RAG query for the examples, we will first embed our prompt, and then querty the weaviate collection for 5 examples:

In [12]:
prompt_embed = genai.embed_content(
    model="models/text-embedding-004",
    content=prompt,
    task_type="retrieval_document",
    title="Embedding of single string"
)['embedding']

In [13]:
collection = client.collections.get("Translations")

response = collection.query.near_vector(
    near_vector=prompt_embed,
    limit=5,
)

for obj in response.objects:
    print(obj.properties["en"])
    print(obj.properties["bn"])

examples = response.objects

It doesn't assemble quickly.
এটি দ্রুত একত্রিত হয় না।
As the frequency is very less, you do not save time here.
যেহেতু ফ্রিকোয়েন্সি খুব কম, আপনি এখানে সময় বাঁচাতে পারবেন না।
Were not very punctual in the past.
আগে আমরা খুব একটা সময়ানুবর্তী ছিলাম না।
It is not long-lasting.
বেশিদিন টিকবে না।
It is very expensive.
এটা অনেক দামি।


With the examples now on hand, we can construct our few-shot prompt. Notice how the non-example part of the prompt is very similar to our zero-shot prompt from up top.

In [14]:
full_prompt = f"""
তুমি একজন সহায়ক সহকারী যে ইংরেজি থেকে বাংলা অনুবাদ করে।

উদাহরণ অনুবাদ:

ইংরেজি:
{examples[0].properties["en"]}
বাংলা:
{examples[0].properties["bn"]}

ইংরেজি:
{examples[1].properties["en"]}
বাংলা:
{examples[1].properties["bn"]}

ইংরেজি:
{examples[2].properties["en"]}
বাংলা:
{examples[2].properties["bn"]}

ইংরেজি:
{examples[3].properties["en"]}
বাংলা:
{examples[3].properties["bn"]}

ইংরেজি:
{examples[4].properties["en"]}
বাংলা:
{examples[4].properties["bn"]}

এখন নিম্নলিখিত প্রম্পটটি ইংরেজি থেকে বাংলায় অনুবাদ করো:

ইংরেজি:
{prompt}
বাংলা:
"""

In [15]:
model = genai.GenerativeModel("gemini-1.5-flash")
fs_response = model.generate_content(full_prompt)
print(fs_response.text)

বাংলা: 

আমি সময়মতো এটা সম্পন্ন করতে নাও পারি। অনুগ্রহ করে আমাকে কি একটু সময় বাড়িয়ে দেওয়া যাবে? 



Voila! Again, for those of you not familiar with Bengali, this translation gets much closer in terms of pragmatics to how this prompt should ideally be translated to Bengali. The part where an extension is requested is much less blunt than the zero-shot translation, and the meaning of the first sentence is much better conveyed here.

In [16]:
#| echo: false
#| output: false
client.close()