# Few-shot Translation + Back-Translation with mBART
This notebook demonstrates how to fine-tune a multilingual translation model using a small parallel dataset and then leverage it for back-translation, a common technique for data augmentation in low-resource machine translation.

In [None]:
!pip install transformers datasets torch accelerate --quiet

### Dataset: HackHedron English-Telugu Parallel Corpus

We use the [HackHedron/English\_Telugu\_Parallel\_Corpus](https://huggingface.co/datasets/HackHedron/English_Telugu_Parallel_Corpus) hosted on Hugging Face. It contains English–Telugu sentence pairs suitable for fine-tuning multilingual models like mBART.







In [10]:
from datasets import load_dataset
dataset = load_dataset("HackHedron/English_Telugu_Parallel_Corpus")
dataset

DatasetDict({
    train: Dataset({
        features: ['english', 'telugu'],
        num_rows: 433845
    })
})

### Model: mBART-50 for English ↔ Telugu

We use the `"facebook/mbart-large-50-many-to-many-mmt"` model, which supports 50 languages for translation. We set the source and target language codes for English and Telugu.

In [None]:
from transformers import MBartForConditionalGeneration, MBart50TokenizerFast, Trainer, TrainingArguments

model_name = "facebook/mbart-large-50-many-to-many-mmt"
tokenizer = MBart50TokenizerFast.from_pretrained(model_name)
model = MBartForConditionalGeneration.from_pretrained(model_name)

tokenizer.src_lang = "en_XX"
tokenizer.tgt_lang = "te_IN"

### Preprocessing & Tokenization

We use Hugging Face `datasets` to tokenize and prepare inputs for training. Only 500 examples are selected to simulate a **few-shot** setting.


In [None]:
def preprocess_function(example):
    model_inputs = tokenizer(
        example["english"],
        max_length=128,
        truncation=True,
        padding="max_length"
    )
    
    # NEW: Use `text_target` instead of `as_target_tokenizer`
    labels = tokenizer(
        text_target=example["telugu"],
        max_length=128,
        truncation=True,
        padding="max_length"
    )
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

    
small_dataset = dataset["train"].shuffle(seed=42).select(range(500))

tokenized_datasets = small_dataset.map(preprocess_function)#, remove_columns=["english", "telugu"])

### Fine-tuning the Model

We use Hugging Face `Trainer` API with basic `TrainingArguments`. Training is done for a few epochs with a small batch size and learning rate.


In [None]:
args = TrainingArguments(
    output_dir="./eng2te_fewshot",
    per_device_train_batch_size=8,
    learning_rate=5e-5,
    num_train_epochs=5,
    save_total_limit=1,    
    save_steps=500,
    logging_dir="./logs",
    report_to="none",
    disable_tqdm=False,
    fp16=True,
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized_datasets,
    tokenizer=tokenizer,
)

trainer.train()

### Back-Translation: English → Telugu (Synthetic)

We use the fine-tuned model to generate **synthetic Telugu translations** from **monolingual English sentences**. This can help augment parallel datasets.

In [16]:
monolingual_english = ["I love this phone.", "The movie was fantastic.", "I feel tired today."]


inputs = tokenizer(monolingual_english, return_tensors="pt").to(model.device)

# Generate synthetic Telugu
outputs = model.generate(**inputs, forced_bos_token_id=tokenizer.lang_code_to_id["te_IN"])
telugu_translations = tokenizer.batch_decode(outputs, skip_special_tokens=True)

# Now you have synthetic Telugu ↔ English pairs
for en, te in zip(monolingual_english, telugu_translations):
    print(f"EN: {en}  ⇄  TE (synthetic): {te}")


EN: I love this phone.  ⇄  TE (synthetic): నేను ఈ ఫోన్ను ప్రేమిస్తున్నాను.
EN: The movie was fantastic.  ⇄  TE (synthetic): చిత్రం అద్భుతమైనది.
EN: I feel tired today.  ⇄  TE (synthetic): నేను ఈ రోజు అలసిపోతుంది ఉన్నాను.
