# My Step-by-Step Approach

1. **Prepare the Base Model:**  
   I begin with a Bengali–English model that has been fine-tuned solely on high-quality Bengali–English parallel data.

2. **Generate Synthetic Bengali–Assamese Data:**  
   - I collect a set of Bengali monolingual sentences.  
   - I use an LLM (e.g., ChatGPT) to automatically generate Assamese translations for these Bengali sentences.  
   - Optionally, I review and filter these synthetic pairs to ensure quality.

3. **Fine-Tune with Synthetic Data:**  
   I further fine-tune my Bengali–English model using the synthetic Bengali–Assamese parallel data. This step trains the model to output Assamese when given Bengali input, effectively mapping Bengali representations to Assamese text.

4. **Test the New Capability:**  
   - I evaluate the fine-tuned model by feeding it Bengali input and confirming that it produces coherent Assamese outputs.  
   - For zero-shot English-to-Assamese translation, I first translate English into Bengali (using a separate English–Bengali model) and then pass the Bengali text to my fine-tuned model.

5. **Iterate as Needed:**  
   If the results aren’t satisfactory, I refine the synthetic data quality or adjust the fine-tuning parameters.

This two-stage process leverages both conventional fine-tuning and synthetic data (backtranslation-like) to indirectly achieve zero-shot English-to-Assamese translation.

### Step 1: **Prepare the Base Model:**  
   I begin with a Bengali–English model that has been fine-tuned solely on high-quality Bengali–English parallel data.

In [None]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from normalizer import normalize  # Install via: pip install git+https://github.com/csebuetnlp/normalizer

# reference needs to be provided: https://huggingface.co/csebuetnlp/banglat5

# Load the BanglaT5 model and its tokenizer
model_name = "csebuetnlp/banglat5"
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)

  from .autonotebook import tqdm as notebook_tqdm
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


In [9]:
# Example Bengali sentence (change this to your desired input)
input_sentence = "আমি বাংলায় কথা বলি।"

# Normalize and tokenize the input sentence
normalized_text = normalize(input_sentence)
input_ids = tokenizer(normalized_text, return_tensors="pt").input_ids

# Generate with beam search and adjusted repetition control
generated_tokens = model.generate(
    input_ids,
    # max_length=50,
    # num_beams=5,                 # Use beam search with 5 beams
    # no_repeat_ngram_size=3,      # Prevent 3-gram repetitions
    # repetition_penalty=1.2,      # Slight penalty for repetitions
    # early_stopping=True
)
translation = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0]

print("Input (Bengali):", input_sentence)
print("Translation (English):", translation)

Input (Bengali): আমি বাংলায় কথা বলি।
Translation (English): I speak in Bengali. "আমি বাংলা ভাষায় কথা বলি। "আমি বাংলায় কথা বলি।


---
### Step 2: **Generate Synthetic Bengali–Assamese Data:**  
   - I collect a set of Bengali monolingual sentences.  
   - I use an LLM (e.g., ChatGPT) to automatically generate Assamese translations for these Bengali sentences.  
   - Optionally, I review and filter these synthetic pairs to ensure quality.


### Step 3: **Fine-Tune with Synthetic Data:**  
   I further fine-tune my Bengali–English model using the synthetic Bengali–Assamese parallel data. This step trains the model to output Assamese when given Bengali input, effectively mapping Bengali representations to Assamese text.

### Step 4: **Test the New Capability:**  
   - I evaluate the fine-tuned model by feeding it Bengali input and confirming that it produces coherent Assamese outputs.  
   - For zero-shot English-to-Assamese translation, I first translate English into Bengali (using a separate English–Bengali model) and then pass the Bengali text to my fine-tuned model.

### Step 5: **Iterate as Needed:**  
   If the results aren’t satisfactory, I refine the synthetic data quality or adjust the fine-tuning parameters.