# Task
Evaluate the Llama model's translation performance on 100 sampled rows from the `shenasa/English-Persian-Parallel-Dataset` by calculating and presenting BLEU, CHRF++, and BERTScore metrics.

## Load Model, Tokenizer, and Dataset

### Subtask:
Load the Llama model and tokenizer using `AutoModelForCausalLM` and `AutoTokenizer` from `transformers` as specified. Then, load the `shenasa/English-Persian-Parallel-Dataset` using `load_dataset` from `datasets`.


In [20]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset

# Choose a specific Llama model for translation
model = AutoModelForCausalLM.from_pretrained("Sheikhaei/llama-3.2-1b-en-fa-translator", torch_dtype="auto", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("Sheikhaei/llama-3.2-1b-en-fa-translator")


# Load the dataset
dataset = load_dataset('shenasa/English-Persian-Parallel-Dataset')

# Print the loaded dataset to inspect its structure
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['flash fire .', 'فلاش آتش .'],
        num_rows: 3960172
    })
})


## Inspect Dataset and Prepare Sample

### Subtask:
Inspect the structure of the loaded dataset to identify the appropriate source and target language columns. Then, sample 100 rows from the dataset for evaluation, making sure to extract the source and target texts.


In [50]:
import random

source_column = 'flash fire .'
target_column = 'فلاش آتش .'

# Sample 100 random rows from the 'train' split
sampled_indices = random.sample(range(len(dataset['train'])), 100)
sampled_dataset = dataset['train'].select(sampled_indices)

# Extract source and target texts
source_texts = sampled_dataset[source_column]
target_texts = sampled_dataset[target_column]

print(f"Sampled {len(source_texts)} source texts and {len(target_texts)} target texts.")
print("First 3 source texts:", source_texts[:3])
print("First 3 target texts:", target_texts[:3])

Sampled 100 source texts and 100 target texts.
First 3 source texts: ['for Janni was physically precocious and had quickly taken the leadership of the pair.', 'Whatever the cause, you should know that God hates divorce and says:', 'LanSAages: Kazakh, Russian']
First 3 target texts: ['زیرا جانی از نظر بدنی زودرس بود و به سرعت رهبری این زوج را به دست گرفته بود .', 'علتش هرچه که باشد باید بدانید که خداوند از طلاق بیزار است و می فرماید :', 'LanSAages : قزاق ، روسی']


In [51]:
import torch
import random

# Initialize a list to store generated Persian texts
generated_persian_texts = []

# Generate 10 random indices from the 100 available samples
# source_texts and target_texts already contain 100 random samples from the full dataset.
random_100_indices = random.sample(range(len(source_texts)), 100)

# Iterate through the 100 random items from the source_texts and target_texts lists
for i, idx in enumerate(random_100_indices):
    english_text = source_texts[idx]
    actual_persian_text = target_texts[idx]

    print(f"\n--- Sample {i+1} ---")

    # a. Construct the prompt string
    prompt = f"### English:\n{english_text}\n### Persian:\n"

    # b. Tokenize the prompt
    inputs = tokenizer(prompt, return_tensors='pt').to(model.device)

    # c. Generate the translation
    # Ensure to only generate new tokens, not re-generate the prompt
    # max_new_tokens is crucial here to prevent infinite generation
    outputs = model.generate(
        **inputs,
        max_new_tokens=100,
        do_sample=True,
        top_k=50,
        top_p=0.95,
        num_return_sequences=1,
        pad_token_id=tokenizer.eos_token_id # Set pad_token_id to eos_token_id
    )

    # d. Decode the generated tokens back to a string
    # Decode the *new* tokens, not the input prompt
    translated_text = tokenizer.decode(outputs[0][len(inputs['input_ids'][0]):], skip_special_tokens=True)

    # e. Extract only the Persian translation from translated_text
    # The model might sometimes generate text that includes the prompt structure itself if not properly constrained.
    # We are looking for the content that *should* be the Persian translation.
    predicted_persian_text = translated_text.strip()

    # f. Append the predicted_persian_text to the generated_persian_texts list
    generated_persian_texts.append(predicted_persian_text)

    # g. Print the original English text, the predicted_persian_text, and the actual Persian text
    print(f"Original English: {english_text}")
    print(f"Predicted Persian: {predicted_persian_text}")
    print(f"Actual Persian: {actual_persian_text}")

print(f"\nGenerated {len(generated_persian_texts)} Persian translations.")


--- Sample 1 ---
Original English: Compare various nutrients in fruits, vegetables, cereals and legumes that are rich in Water.
Predicted Persian: تفاوت مواد مغذی مختلف در میوه‌ها، سبزیجات، گلاتان و لجن‌داران را با توجه به غلظت آب بررسی کنید.
Actual Persian: مواد مغذی مختلف موجود در میوه ها ، سبزیجات ، غلات و حبوبات را که سرشار از آب هستند ، مقایسه کنید .

--- Sample 2 ---
Original English: The main changes are:
Predicted Persian: تغییرات اصلی:
Actual Persian: تغییرات اصلی عبارتند از :

--- Sample 3 ---
Original English: Basics of Ecology (for students of non-biological specialties of the University).
Predicted Persian: معلولیت‌ها در اکولوژی (برای دانشجویان رشته‌های غیرزیست‌شناسی دانشگاهی).
Actual Persian: مبانی بوم شناسی ( برای دانشجویان رشته های غیر زیستی دانشگاه ).

--- Sample 4 ---
Original English: A 17th century Chateau of of about 1000 m2 habittable plus numerous annexes in an exceptional and tranquil setting of 150 hectares hence 40 hectaures of vineyards all A.O.C..Beautiful 

In [35]:
!pip install evaluate bert_score



# Task
Calculate BLEU, CHRF++, and BERTScore metrics for the 10 generated predictions (`generated_persian_texts`) against their corresponding reference target language sentences (`target_texts[:10]`) and present the results.

## Calculate BLEU Score

### Subtask:
Compute the BLEU score between the 10 generated predictions and the corresponding reference target language sentences.


**Reasoning**:
I will import the `evaluate` library, load the 'bleu' metric, prepare the `generated_persian_texts` and `target_texts` for the first 10 samples in the correct format, compute the BLEU score, and then print it.



In [53]:
import evaluate

# Load the BLEU metric
bleu = evaluate.load("bleu")

# Prepare references: each reference needs to be in a list
# We generated only 10 predictions, so we take the first 100 target texts.
references_for_bleu = [[text] for text in target_texts[:100]]

# Compute BLEU score
results = bleu.compute(predictions=generated_persian_texts, references=references_for_bleu)

# Print the BLEU score
print("BLEU score:", results["bleu"])

BLEU score: 0.0044026828556608215


In [54]:
import evaluate

#!pip install sacrebleu

# Load the CHRF++ metric
chrf = evaluate.load("chrf")

# Prepare references: each reference needs to be in a list
# We generated only 10 predictions, so we take the first 100 target texts.
# The CHRF++ metric expects references as a list of lists of strings, just like BLEU.
references_for_chrf = [[text] for text in target_texts[:100]]

# Compute CHRF++ score
chrf_results = chrf.compute(predictions=generated_persian_texts, references=references_for_chrf)

# Print the CHRF++ score
print("CHRF++ score:", chrf_results["score"])

CHRF++ score: 7.6024692349810294


**Reasoning**:
With BLEU and CHRF++ scores calculated, the next step is to compute the BERTScore for the 10 generated predictions against their corresponding reference target language sentences. This will involve loading the 'bertscore' metric from the `evaluate` library, ensuring the references are in the correct format, computing the score, and then printing the result.



In [55]:
import evaluate

# Load the BERTScore metric
bertscore = evaluate.load("bertscore")

# Prepare references: each reference needs to be in a list
# We generated only 10 predictions, so we take the first 100 target texts.
# BERTScore expects references as a list of lists of strings, similar to BLEU/CHRF++
references_for_bertscore = [[text] for text in target_texts[:100]]

# Compute BERTScore
bertscore_results = bertscore.compute(
    predictions=generated_persian_texts,
    references=references_for_bertscore,
    lang="fa" # Specify the language as Persian
)

# Print the BERTScore results (P, R, F1)
print("BERTScore Precision:", bertscore_results["precision"])
print("BERTScore Recall:", bertscore_results["recall"])
print("BERTScore F1:", bertscore_results["f1"])

BERTScore Precision: [0.6287417411804199, 0.6803514361381531, 0.6054458618164062, 0.5365273952484131, 0.6286966800689697, 0.6461114287376404, 0.7391335964202881, 0.6548765897750854, 0.7087835073471069, 0.623328447341919, 0.5369585156440735, 0.6535769104957581, 0.43197041749954224, 0.6227776408195496, 0.7288725972175598, 0.598240315914154, 0.6820625066757202, 0.580910325050354, 0.5808415412902832, 0.5924385786056519, 0.6192103028297424, 0.5028626322746277, 0.5990843772888184, 0.6556053161621094, 0.5775143504142761, 0.686211884021759, 0.5187636613845825, 0.6389246582984924, 0.6780828237533569, 0.6063700318336487, 0.6839640736579895, 0.5872628688812256, 0.6212063431739807, 0.6209208369255066, 0.6213726997375488, 0.608467698097229, 0.5088251829147339, 0.6929677724838257, 0.5036611557006836, 0.5510408878326416, 0.626553475856781, 0.5942873954772949, 0.5757968425750732, 0.6346117258071899, 0.6681174039840698, 0.6149377226829529, 0.5847890377044678, 0.6771902441978455, 0.7102380394935608, 0.6

Bleu :BLEU score: 0.0044026828556608215

ChRF : 7.6024692349810294

Bertscore:0.6287417411804199