## Load Model and Tokenizer

### Subtask:
Load the 'facebook/mbart-large-50-many-to-many-mmt' model and its tokenizer from Hugging Face Transformers. Set the source language to English ('en_XX') and the target language to Persian ('fa_IR').


In [None]:
#!pip install evaluate sacrebleu
#above is a pip install is needed so run it

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("facebook/mbart-large-50-many-to-many-mmt")

# Set the source language to English
tokenizer.src_lang = "en_XX"

# Load the model
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/mbart-large-50-many-to-many-mmt")

# Set the target language for the tokenizer
tokenizer.tgt_lang = "fa_IR"

print("Model and Tokenizer loaded successfully.")
print(f"Tokenizer source language set to: {tokenizer.src_lang}")
print(f"Tokenizer target language set to: {tokenizer.tgt_lang}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/529 [00:00<?, ?B/s]

config.json: 0.00B [00:00, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/649 [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/2.44G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/261 [00:00<?, ?B/s]

Model and Tokenizer loaded successfully.
Tokenizer source language set to: en_XX
Tokenizer target language set to: fa_IR


**Reasoning**:
The previous step successfully loaded the model and tokenizer. The next logical step is to prepare the input data for translation as per the task description, which involves extracting the first 500 English sentences from the loaded dataset.



**Reasoning**:
The previous code failed because the `ds` variable, which holds the loaded dataset, was not defined. This happened because the cell where `ds` is loaded (cell_id: kptDR97Q6zVG) has not been executed yet. I need to re-execute the dataset loading cell first.



In [None]:
from datasets import load_dataset

ds = load_dataset("shenasa/English-Persian-Parallel-Dataset")

README.md: 0.00B [00:00, ?B/s]

dataset.tsv:   0%|          | 0.00/872M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/3960172 [00:00<?, ? examples/s]

## Prepare Dataset Subset

### Subtask:
Extract the first 500 rows from the loaded dataset for evaluation. Create a pandas DataFrame to store the source text, model predictions, and ground truth.


In [None]:
print(ds['train'].features)
print(ds['train'][0])

{'flash fire .': Value('string'), 'فلاش آتش .': Value('string')}
{'flash fire .': 'superheats the air . burns the lungs like rice paper .', 'فلاش آتش .': 'هوا را فوق العاده گرم می کند . ریه ها را مثل کاغذ برنج می سوزاند .'}


In [None]:
import pandas as pd

first_500_entries = ds['train'].select(range(500))
english_texts = [entry['flash fire .'] for entry in first_500_entries]
persian_ground_truth_texts = [entry['فلاش آتش .'] for entry in first_500_entries]

df_evaluation = pd.DataFrame({
    'Source Text (EN)': english_texts,
    'Ground Truth (FA)': persian_ground_truth_texts,
    'Model Prediction (FA)': [''] * len(english_texts) # Initialize with empty strings
})

print(f"Extracted {len(english_texts)} English sentences and {len(persian_ground_truth_texts)} Persian ground truth sentences.")
print("Evaluation DataFrame created successfully with the first 5 entries:")
print(df_evaluation.head())

Extracted 500 English sentences and 500 Persian ground truth sentences.
Evaluation DataFrame created successfully with the first 5 entries:
                                    Source Text (EN)  \
0  superheats the air . burns the lungs like rice...   
1               hey , guys . down here . down here .   
2  what do you got down this corridor is the bow ...   
3  theres an access hatch right there that puts u...   
4  we get into the propeller tubes and the only t...   

                                   Ground Truth (FA) Model Prediction (FA)  
0  هوا را فوق العاده گرم می کند . ریه ها را مثل ک...                        
1              سلام بچه ها . این پایین . این پایین .                        
2   چه چیزی در این راهرو پایین آمده است ، درست است .                        
3  یک دریچه دسترسی درست در آنجا وجود دارد که ما ر...                        
4  وارد لوله های پروانه می شویم و تنها چیزی که بی...                        


## Generate Translations

### Subtask:
Iterate through the first 500 rows. For each row, tokenize the English source text, generate a translation using the loaded mBART model, and decode the generated tokens into Persian text. Store the source, prediction, and ground truth (extracted from the 'targets' list) for each row.


In [None]:
import torch

# Initialize an empty list to store model predictions
model_predictions = []

# Set device for model (GPU if available, else CPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Iterate through each English source text
for english_text in df_evaluation['Source Text (EN)']:
    # Tokenize the English text
    encoded_input = tokenizer(english_text, return_tensors="pt", padding=True, truncation=True).to(device)

    # Generate the translation
    generated_tokens = model.generate(
        **encoded_input,
        forced_bos_token_id=tokenizer.lang_code_to_id["fa_IR"],
        max_length=512, # A reasonable max_length for typical sentences
        num_beams=5 # Using beam search for better translation quality
    )

    # Decode the generated tokens into Persian text
    decoded_translation = tokenizer.decode(generated_tokens[0], skip_special_tokens=True)

    # Append the decoded translation to the list
    model_predictions.append(decoded_translation)

# Update the 'Model Prediction (FA)' column in the df_evaluation DataFrame
df_evaluation['Model Prediction (FA)'] = model_predictions

print("Translations generated and updated in df_evaluation.")
print("First 5 entries of df_evaluation with predictions:")
print(df_evaluation.head())

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Translations generated and updated in df_evaluation.
First 5 entries of df_evaluation with predictions:
                                    Source Text (EN)  \
0  superheats the air . burns the lungs like rice...   
1               hey , guys . down here . down here .   
2  what do you got down this corridor is the bow ...   
3  theres an access hatch right there that puts u...   
4  we get into the propeller tubes and the only t...   

                                   Ground Truth (FA)  \
0  هوا را فوق العاده گرم می کند . ریه ها را مثل ک...   
1              سلام بچه ها . این پایین . این پایین .   
2   چه چیزی در این راهرو پایین آمده است ، درست است .   
3  یک دریچه دسترسی درست در آنجا وجود دارد که ما ر...   
4  وارد لوله های پروانه می شویم و تنها چیزی که بی...   

                               Model Prediction (FA)  
0  هوا را بسیار گرم می کند. ریه ها را مانند کاغذ ...  
1             هی ، بچه ها. اینجا پایین. اینجا پایین.  
2  چیزی که شما در این راهرو می گیرید قوس است ، در...  
3 

In [None]:
print("Displaying 10 random samples from the evaluation DataFrame:")
print(df_evaluation.sample(n=10))

Displaying 10 random samples from the evaluation DataFrame:
                                      Source Text (EN)  \
452  it means that the ministry it is going to inte...   
36                        l got him . come on . lift .   
48   we throw it up into the props , and itll jam e...   
310  discipline audience of the day twelve of august .   
24                                            maggie .   
128                     it moans sleeping all nights .   
197                      how good that you feel well .   
324  you know that it is prohibited to use magic ou...   
199                    a without time to explanation .   
182                     but you did not go , not yet .   

                                     Ground Truth (FA)  \
452  این بدان معناست که وزارتخانه در هاگوارتز مداخل...   
36                       گرفتمش بیا دیگه . بلند کردن .   
48   ما آن را در پایه ها می اندازیم و همه چیز را مخ...   
310                  انضباط مخاطبان روز دوازده مرداد .   
24         

the numbers near the sample data is their number on the dataset and they were chosen randomly

## Evaluate Model Performance with BLEU Score

### Subtask:
Calculate the BLEU score to quantitatively assess the translation quality of the mBART model against the ground truth translations. Display the calculated BLEU score.

In [None]:


import evaluate

# Load the BLEU metric
metric = evaluate.load("sacrebleu")

# Prepare references and predictions for BLEU calculation
# SacreBLEU expects references as a list of lists (each inner list contains one reference translation for a segment)
references = [[ref] for ref in df_evaluation['Ground Truth (FA)'].tolist()]
predictions = df_evaluation['Model Prediction (FA)'].tolist()

# Compute the BLEU score
bleu_score = metric.compute(predictions=predictions, references=references)

print("BLEU Score for the translations:")
print(bleu_score)

BLEU Score for the translations:
{'score': 29.586730098081215, 'counts': [2640, 1374, 775, 436], 'totals': [4346, 3846, 3346, 2860], 'precisions': [60.745513115508516, 35.725429017160685, 23.16198445905559, 15.244755244755245], 'bp': 1.0, 'sys_len': 4346, 'ref_len': 4010}
