**Introduction**

This project focuses on English-to-Hindi Neural Machine Translation (NMT) using the Hugging Face Transformers library. The goal is to fine-tune a pre-trained MarianMT model (Helsinki-NLP/opus-mt-en-hi) on the IIT Bombay English-Hindi Parallel Corpus, enabling high-quality translation between English and Hindi text.

Machine Translation is a key task in Natural Language Processing (NLP) that enables communication across language barriers by automatically converting text from one language to another. Traditional rule-based translation systems often struggle with linguistic complexity, idiomatic expressions, and contextual nuances. Neural Machine Translation (NMT), however, leverages deep learning architectures, particularly sequence-to-sequence (Seq2Seq) models with attention mechanisms, to produce fluent and contextually accurate translations.

In this project, the workflow includes:

Dataset Loading: Using the cfilt/iitb-english-hindi dataset from the Hugging Face Hub.

Model Fine-Tuning: Fine-tuning the pre-trained MarianMT model on the IIT Bombay dataset to adapt it to the English–Hindi translation task.

Evaluation: Measuring translation quality using the SacreBLEU metric.

Deployment: Creating an interactive translation interface with Gradio, allowing real-time text translation from English to Hindi.

By fine-tuning a pre-trained transformer-based model, this project demonstrates how transfer learning can be leveraged for multilingual translation tasks — achieving strong results even with limited computational resources.

**Installing Libraries**

In [1]:
!pip install datasets
!pip install transformers
!pip install sentencepiece
!pip install transformers[torch]`
!pip install sacrebleu
!pip install evaluate
!pip install sacrebleu
!pip install accelerate - U
!pip install gradio
!pip install kaleido cohere  openai tiktoken typing - extensions == 4.5.0

/bin/bash: -c: line 1: unexpected EOF while looking for matching ``'
/bin/bash: -c: line 2: syntax error: unexpected end of file
Collecting sacrebleu
  Downloading sacrebleu-2.5.1-py3-none-any.whl.metadata (51 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.8/51.8 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting portalocker (from sacrebleu)
  Downloading portalocker-3.2.0-py3-none-any.whl.metadata (8.7 kB)
Collecting colorama (from sacrebleu)
  Downloading colorama-0.4.6-py2.py3-none-any.whl.metadata (17 kB)
Downloading sacrebleu-2.5.1-py3-none-any.whl (104 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m104.1/104.1 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading colorama-0.4.6-py2.py3-none-any.whl (25 kB)
Downloading portalocker-3.2.0-py3-none-any.whl (22 kB)
Installing collected packages: portalocker, colorama, sacrebleu
Successfully installed colorama-0.4.6 portalocker-3.2.0 sacrebleu-2.5.1
Collecting evaluate
  

**Loading the Dataset**

In [2]:
from datasets import load_dataset
dataset = load_dataset("cfilt/iitb-english-hindi")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

dataset_infos.json:   0%|          | 0.00/953 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/190M [00:00<?, ?B/s]

data/validation-00000-of-00001.parquet:   0%|          | 0.00/85.7k [00:00<?, ?B/s]

data/test-00000-of-00001.parquet:   0%|          | 0.00/500k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1659083 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/520 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/2507 [00:00<?, ? examples/s]

**Load Model and Tokenizer**

In [3]:
max_length = 256

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-hi")
model = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-en-hi")

tokenizer_config.json:   0%|          | 0.00/44.0 [00:00<?, ?B/s]

config.json: 0.00B [00:00, ?B/s]

source.spm:   0%|          | 0.00/812k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/1.07M [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]



pytorch_model.bin:   0%|          | 0.00/306M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/306M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

**Example Translation**

In [4]:
article = dataset['validation'][2]['translation']['en']
inputs = tokenizer(article, return_tensors="pt")

translated_tokens = model.generate(
    **inputs, max_length=256
)
tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]

'एमएनएपी शिक्षकों के राष्ट्रपति, राजस्वीवर ने इस पुरस्कार को पेश करने के द्वारा स्कूल की प्रतिष्ठा की.'

**Tokenize the Dataset**

In [5]:
def preprocess_function(examples):
    inputs = [ex["en"] for ex in examples["translation"]]
    targets = [ex["hi"] for ex in examples["translation"]]

    model_inputs = tokenizer(inputs, max_length=max_length, truncation=True)
    labels = tokenizer(targets, max_length=max_length, truncation=True)
    model_inputs["labels"] = labels["input_ids"]

    return model_inputs

In [6]:
tokenized_datasets_validation = dataset['validation'].map(
    preprocess_function,
    batched=True,
    remove_columns=dataset["validation"].column_names,
    batch_size=2
)

tokenized_datasets_test = dataset['test'].map(
    preprocess_function,
    batched=True,
    remove_columns=dataset["test"].column_names,
    batch_size=2)

Map:   0%|          | 0/520 [00:00<?, ? examples/s]

Map:   0%|          | 0/2507 [00:00<?, ? examples/s]

**Define the Data Collator**

In [7]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

**Set Model Training Parameters**

In [8]:
for parameter in model.parameters():
    parameter.requires_grad = True
num_layers_to_freeze = 10
for layer_index, layer in enumerate(model.model.encoder.layers):
    print
    if layer_index < len(model.model.encoder.layers) - num_layers_to_freeze:
        for parameter in layer.parameters():
            parameter.requires_grad = False

num_layers_to_freeze = 10
for layer_index, layer in enumerate(model.model.decoder.layers):
    print
    if layer_index < len(model.model.encoder.layers) - num_layers_to_freeze:
        for parameter in layer.parameters():
            parameter.requires_grad = False

**Evaluate the Model**

In [9]:
import evaluate

metric = evaluate.load("sacrebleu")

import numpy as np


def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]

    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    decoded_preds = [pred.strip() for pred in decoded_preds]
    decoded_labels = [[label.strip()] for label in decoded_labels]

    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    return {"bleu": result["score"]}

Downloading builder script: 0.00B [00:00, ?B/s]

**Train the Model**

In [19]:
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

from transformers import Seq2SeqTrainingArguments

model.to(device)
training_args = Seq2SeqTrainingArguments(
    f"finetuned-nlp-en-hi",
    gradient_checkpointing=True,
    per_device_train_batch_size=32,
    learning_rate=1e-5,
    warmup_steps=2,
    max_steps=20,
    fp16=True,
    optim='adafactor',
    per_device_eval_batch_size=16,
    metric_for_best_model="eval_bleu",
    predict_with_generate=True,
    push_to_hub=False,
    report_to="none",
)

In [20]:
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    model,
    training_args,
    train_dataset=tokenized_datasets_test,
    eval_dataset=tokenized_datasets_validation,
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

trainer.train()

  trainer = Seq2SeqTrainer(


Step,Training Loss




TrainOutput(global_step=20, training_loss=3.425583267211914, metrics={'train_runtime': 1252.3041, 'train_samples_per_second': 0.511, 'train_steps_per_second': 0.016, 'total_flos': 11237307973632.0, 'train_loss': 3.425583267211914, 'epoch': 0.25316455696202533})

**Building an Interactive Gradio App**

In [21]:
import gradio as gr


def translate(text):
  inputs = tokenizer(text, return_tensors="pt").to(device)
  translated_tokens = model.generate(**inputs,  max_length=256)
  results = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]
  return results

interface = gr.Interface(fn=translate,inputs=gr.Textbox(lines=2, placeholder='Text to translate'),
                        outputs='text')

interface.launch()

It looks like you are running Gradio on a hosted Jupyter notebook, which requires `share=True`. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://9eb514eddb2738c712.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




**Conclusion**

This project successfully demonstrates the process of fine-tuning a pre-trained transformer-based Neural Machine Translation (NMT) model for English–Hindi translation using the Hugging Face Transformers ecosystem. By leveraging the MarianMT model (Helsinki-NLP/opus-mt-en-hi) and the IIT Bombay English–Hindi Parallel Corpus, the system effectively learns to generate fluent and contextually accurate translations.

Through preprocessing, tokenization, fine-tuning, and evaluation using the SacreBLEU metric, the model achieved meaningful translation performance within a limited training setup. The project also showcased how transfer learning and layer freezing can optimize training efficiency and prevent overfitting, particularly when fine-tuning large transformer architectures on bilingual data.

The integration of a Gradio interface provided an intuitive, user-friendly platform to test the model in real-time, bridging the gap between research and practical application. This emphasizes how deep learning models can be deployed for everyday multilingual communication tasks.