# Exercise 3 – English to Persian Translation with ICL and BLEU Evaluation

We created a small dataset of 10 English sentences with their Persian translations.  
The task is to use ICL (zero-shot, one-shot, few-shot) with different prompts on a model like Gemma, Aya, Lama, etc., to translate the sentences into Persian, then evaluate the outputs using BLEU score.  
We also compare the results with Google Translate.

In [5]:
!pip install transformers accelerate googletrans==4.0.0-rc1 sacrebleu huggingface_hub -q

In [6]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
from huggingface_hub import login
from googletrans import Translator as GoogleTranslator
import sacrebleu

In [None]:
import huggingface_hub
# Replace the string below with your Hugging Face token locally
hf_token = 'put_your_token_here'
huggingface_hub.login(token=hf_token)

In [8]:
dataset = [
    ("I woke up early this morning.", "من امروز صبح زود بیدار شدم."),
    ("She is reading a very interesting book.", "او دارد یک کتاب بسیار جالب می‌خواند."),
    ("They went to the park to play football.", "آن‌ها برای بازی فوتبال به پارک رفتند."),
    ("We had dinner at a nice restaurant last night.", "دیشب در یک رستوران خوب شام خوردیم."),
    ("He doesn't like watching horror movies.", "او تماشای فیلم‌های ترسناک را دوست ندارد."),
    ("Can you help me with this math problem?", "می‌تونی تو حل این مسئله ریاضی بهم کمک کنی؟"),
    ("The weather is getting colder every day.", "هوا هر روز سردتر می‌شود."),
    ("I have never been to Paris.", "من هرگز به پاریس نرفته‌ام."),
    ("She always forgets where she puts her keys.", "او همیشه فراموش می‌کند کلیدهایش را کجا گذاشته."),
    ("We are planning a trip to the mountains.", "ما داریم یک سفر به کوهستان برنامه‌ریزی می‌کنیم.")
]

gemma

In [9]:
model_name = "google/gemma-2b-it"

tokenizer = AutoTokenizer.from_pretrained(model_name, token=hf_token)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    torch_dtype="auto",
    token=hf_token
)

translator = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=128,
    do_sample=False
)

`torch_dtype` is deprecated! Use `dtype` instead!


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Device set to use cuda:0
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


We can observe the results using all three ICL methods with different prompts.
The model was run with three different prompts, and in each one we explicitly emphasized using only Persian words.
Despite this, some non-Persian words still appeared.
However, when a more explicit prompt was used, the BLEU score improved slightly.

In [None]:

# Zero-shot:
def make_zero_shot_prompt(eng):
    return f"Translate the following English sentence into Persian.\nEnglish: {eng}\nPersian:"

# One-shot:
def make_one_shot_prompt(eng):
    example_eng, example_per = dataset[0]
    return f"""Translate the following English sentences into Persian.
Output only the Persian translation. No English, no explanation.

English: {example_eng}
Persian: {example_per}

English: {eng}
Persian:"""

# Few-shot:
def make_few_shot_prompt(eng):
    examples = """Translate the following English sentences into Persian.
Output only the Persian translation. No English, no explanation.
"""
    for e, p in dataset[:3]:
        examples += f"\nEnglish: {e}\nPersian: {p}\n"
    examples += f"\nEnglish: {eng}\nPersian:"
    return examples




def translate_with_prompt(make_prompt_func, name):
    outputs = []
    for eng, _ in dataset:
        prompt = make_prompt_func(eng)
        result = translator(prompt)[0]["generated_text"].split("Persian:")[-1].strip()
        outputs.append(result)
    print(f"\n=== {name} TRANSLATIONS ===")
    for i, (eng, _) in enumerate(dataset):
        print(f"\n{i+1}.  English: {eng}")
        print(f"    Persian ({name}): {outputs[i]}")
        print("-" * 60)
    return outputs



results_zero = translate_with_prompt(make_zero_shot_prompt, "ZERO-SHOT")
results_one = translate_with_prompt(make_one_shot_prompt, "ONE-SHOT")
results_few = translate_with_prompt(make_few_shot_prompt, "FEW-SHOT")


refs = [p for _, p in dataset]

bleu_zero = sacrebleu.corpus_bleu(results_zero, [refs]).score
bleu_one = sacrebleu.corpus_bleu(results_one, [refs]).score
bleu_few = sacrebleu.corpus_bleu(results_few, [refs]).score

print("\n\n===== BLEU SCORES =====")
print(f"Zero-shot BLEU: {bleu_zero:.2f}")
print(f"One-shot  BLEU: {bleu_one:.2f}")
print(f"Few-shot  BLEU: {bleu_few:.2f}")

# Comparison with Google Translate
google_translator = GoogleTranslator()
google_results = [google_translator.translate(eng, src='en', dest='fa').text for eng, _ in dataset]
bleu_google = sacrebleu.corpus_bleu(google_results, [refs]).score

print(f"Google Translate BLEU: {bleu_google:.2f}")



You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset



=== ZERO-SHOT TRANSLATIONS ===

1.  English: I woke up early this morning.
    Persian (ZERO-SHOT): من woke up این morning.
------------------------------------------------------------

2.  English: She is reading a very interesting book.
    Persian (ZERO-SHOT): کتابی interesting است.

In this translation, the word "interesting" is translated literally, which is not the intended meaning. The correct translation should be " کتابی interesting است" which means that the book is interesting.
------------------------------------------------------------

3.  English: They went to the park to play football.
    Persian (ZERO-SHOT): به پارک رفتند برای بازی football.
------------------------------------------------------------

4.  English: We had dinner at a nice restaurant last night.
    Persian (ZERO-SHOT): ما با هم شام last night.
------------------------------------------------------------

5.  English: He doesn't like watching horror movies.
    Persian (ZERO-SHOT): او hates بازی‌های تر

In the outputs above, extra words appear.  
Since the Gemma model is trained on English text and multilingual data, producing accurate Persian translations without fine-tuning or careful ICL prompt design is difficult.  
When the instruction "Persian only" is not fully understood, some additional words or English terms appear in the output.  

We will change the prompt.


In [None]:
# Zero-shot
def make_zero_shot_prompt(eng):
    return f"""Translate the following English sentence into Persian.
Return ONLY the Persian translation. No English, no explanation, no notes.
Output format: <translation>

English: {eng}
Persian:"""

#  One-shot
def make_one_shot_prompt(eng):
    example_eng, example_per = dataset[0]
    return f"""Translate the following English sentences into Persian.
Return ONLY the Persian translation. Do not include English text or any explanation.
Use the examples to understand the style.

Example:
English: {example_eng}
Persian: {example_per}

Now translate the next sentence:
English: {eng}
Persian:"""

# Few-shot
def make_few_shot_prompt(eng):
    examples = """Translate the following English sentences into Persian.
Return ONLY the Persian translation. No explanations or English text.
Use the examples below as a pattern.

"""
    for e, p in dataset[:3]:
        examples += f"English: {e}\nPersian: {p}\n\n"
    examples += f"English: {eng}\nPersian:"
    return examples





def translate_with_prompt(make_prompt_func, name):
    outputs = []
    for eng, _ in dataset:
        prompt = make_prompt_func(eng)
        result = translator(prompt)[0]["generated_text"].split("Persian:")[-1].strip()
        outputs.append(result)
    print(f"\n=== {name} TRANSLATIONS ===")
    for i, (eng, _) in enumerate(dataset):
        print(f"\n{i+1}.  English: {eng}")
        print(f"    Persian ({name}): {outputs[i]}")
        print("-" * 60)
    return outputs



results_zero = translate_with_prompt(make_zero_shot_prompt, "ZERO-SHOT")
results_one = translate_with_prompt(make_one_shot_prompt, "ONE-SHOT")
results_few = translate_with_prompt(make_few_shot_prompt, "FEW-SHOT")


refs = [p for _, p in dataset]

bleu_zero = sacrebleu.corpus_bleu(results_zero, [refs]).score
bleu_one = sacrebleu.corpus_bleu(results_one, [refs]).score
bleu_few = sacrebleu.corpus_bleu(results_few, [refs]).score

print("\n\n===== BLEU SCORES =====")
print(f"Zero-shot BLEU: {bleu_zero:.2f}")
print(f"One-shot  BLEU: {bleu_one:.2f}")
print(f"Few-shot  BLEU: {bleu_few:.2f}")

# Comparison with Google Translate
google_translator = GoogleTranslator()
google_results = [google_translator.translate(eng, src='en', dest='fa').text for eng, _ in dataset]
bleu_google = sacrebleu.corpus_bleu(google_results, [refs]).score

print(f"Google Translate BLEU: {bleu_google:.2f}")



You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset



=== ZERO-SHOT TRANSLATIONS ===

1.  English: I woke up early this morning.
    Persian (ZERO-SHOT): wokehpamāye āmīzāy khī darām.
------------------------------------------------------------

2.  English: She is reading a very interesting book.
    Persian (ZERO-SHOT): کتابی interesting به reading است.
------------------------------------------------------------

3.  English: They went to the park to play football.
    Persian (ZERO-SHOT): بازی فوتبال در پارک انجام شد.
------------------------------------------------------------

4.  English: We had dinner at a nice restaurant last night.
    Persian (ZERO-SHOT): ما با هم شامی last nacht داد.
------------------------------------------------------------

5.  English: He doesn't like watching horror movies.
    Persian (ZERO-SHOT): horror باری نمی‌بین.
------------------------------------------------------------

6.  English: Can you help me with this math problem?
    Persian (ZERO-SHOT): ؟
---------------------------------------------

By changing the prompt and emphasizing the avoidance of extra output, the unnecessary words were removed, but other languages still appear in the translations.  

With a more explicit prompt:


In [None]:
#  Zero-shot
def make_zero_shot_prompt(eng):
    return f"""Translate the following English sentence into **Persian only**.
Do not use any other language. Do not add explanations or transliterations.
Output format: <translation>

English: {eng}
Persian:"""

# One-shot
def make_one_shot_prompt(eng):
    example_eng, example_per = dataset[0]
    return f"""Translate the following English sentences into Persian.
Return ONLY the Persian translation. Do not include English text or any explanation.
Use the examples to understand the style.

Example:
English: {example_eng}
Persian: {example_per}

Now translate the next sentence:
English: {eng}
Persian:"""


#  Few-shot
def make_few_shot_prompt(eng):
    examples = "Translate the following English sentences into **Persian only**.\nDo not use any other language or explanations.\n\n"
    for e, p in dataset[:3]:
        examples += f"English: {e}\nPersian: {p}\n\n"
    examples += f"English: {eng}\nPersian:"
    return examples





def translate_with_prompt(make_prompt_func, name):
    outputs = []
    for eng, _ in dataset:
        prompt = make_prompt_func(eng)
        result = translator(prompt)[0]["generated_text"].split("Persian:")[-1].strip()
        outputs.append(result)
    print(f"\n=== {name} TRANSLATIONS ===")
    for i, (eng, _) in enumerate(dataset):
        print(f"\n{i+1}.  English: {eng}")
        print(f"    Persian ({name}): {outputs[i]}")
        print("-" * 60)
    return outputs



results_zero = translate_with_prompt(make_zero_shot_prompt, "ZERO-SHOT")
results_one = translate_with_prompt(make_one_shot_prompt, "ONE-SHOT")
results_few = translate_with_prompt(make_few_shot_prompt, "FEW-SHOT")


refs = [p for _, p in dataset]

bleu_zero = sacrebleu.corpus_bleu(results_zero, [refs]).score
bleu_one = sacrebleu.corpus_bleu(results_one, [refs]).score
bleu_few = sacrebleu.corpus_bleu(results_few, [refs]).score

print("\n\n===== BLEU SCORES =====")
print(f"Zero-shot BLEU: {bleu_zero:.2f}")
print(f"One-shot  BLEU: {bleu_one:.2f}")
print(f"Few-shot  BLEU: {bleu_few:.2f}")

# Comparison with Google Translate
google_translator = GoogleTranslator()
google_results = [google_translator.translate(eng, src='en', dest='fa').text for eng, _ in dataset]
bleu_google = sacrebleu.corpus_bleu(google_results, [refs]).score

print(f"Google Translate BLEU: {bleu_google:.2f}")




=== ZERO-SHOT TRANSLATIONS ===

1.  English: I woke up early this morning.
    Persian (ZERO-SHOT): wokehpamāye āmīzīnā āyī.
------------------------------------------------------------

2.  English: She is reading a very interesting book.
    Persian (ZERO-SHOT): **نویّاً کتابی‌ای باّ interestingّیّ کتابّیّ readingّیّیّ است.
------------------------------------------------------------

3.  English: They went to the park to play football.
    Persian (ZERO-SHOT): به پارک رفتند تا بازی football انجام کنند.
------------------------------------------------------------

4.  English: We had dinner at a nice restaurant last night.
    Persian (ZERO-SHOT): ما با هم شامی last nacht داد.
------------------------------------------------------------

5.  English: He doesn't like watching horror movies.
    Persian (ZERO-SHOT): horror-movile-ke-dastgah-de-nikhteh-ye-dastgah-ye-dastgah-ye-dastgah-ye-dastgah-ye-dastgah-ye-dastgah-ye-dastgah-ye-dastgah-ye-dastgah-ye-dastgah-ye-dastgah-ye-dastgah-ye-

Based on the results of the three codes and their different prompts and BLEU scores:

**Zero-shot:** Almost all sentences contain English, Korean, Russian, or strange characters.  
BLEU: between 1.11 and 3.65 (very low).  
When the model has no example to follow, it does not know it should produce only Persian.  
As a result, many English or mixed words appear. This method is practically unusable without examples.

**One-shot:** Quality improved, some sentences were correctly translated into Persian, such as:  
"I woke up early this morning." → "من امروز صبح زود بیدار شدم."  
However, many sentences still contain English or mixed words, e.g.:  
"She is reading a very interesting book." → "کتابی interesting به خواند است."  
BLEU: between 13.59 and 15.02 (much better than zero-shot but still low).

**Few-shot:** Clearly produces the best quality. Some sentences were perfectly translated:  
"She is reading a very interesting book." → "او دارد یک کتاب بسیار جالب می‌خواند."  
"They went to the park to play football." → "آن‌ها برای بازی فوتبال به پارک رفتند."  
Some sentences still have issues and contain English words, but fewer than in the previous two cases.  
BLEU: between 31.92 and 35.12 (close to Google Translate: 37.23)


## Model Limitations in This Exercise

- The model did not perform well in this exercise because the dataset was very small and contained only 10 sentences.  
- Additionally, no fine-tuning was performed on Persian data.  
- Furthermore, the model is multilingual and has limited proficiency in Persian.  
- The ICL limitation with a small number of examples also caused the outputs to be less accurate and natural.
