<a href="https://colab.research.google.com/github/Khesorw/AshtraMind/blob/main/IndicTrans2_Notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [8]:
# !pip install transformers sentencepiece evaluate accelerate
# !pip install datasets==3.6.0
# !pip install rouge_score
# !pip install sacrebleu

In [1]:
import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from datasets import load_dataset
import evaluate

# Load test dataset
ds = load_dataset("rahular/itihasa", split="test")
src_texts_raw = [ex["translation"]["en"] for ex in ds]
tgt_texts = [ex["translation"]["sn"] for ex in ds]

# Language tags required by IndicTrans2
src_lang_tag = "eng_Latn"
tgt_lang_tag = "san_Deva"

# Prepend language tags to source text
src_texts = [f"{src_lang_tag} {tgt_lang_tag} {text}" for text in src_texts_raw]

# Load tokenizer and model
model_name = "ai4bharat/indictrans2-en-indic-dist-200M"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name, trust_remote_code=True)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()

# Load evaluation metrics
bleu = evaluate.load("bleu")
rouge = evaluate.load("rouge")
chrf = evaluate.load("chrf")

# Translation function
def translate_batch(texts, max_length=128):
    inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True, max_length=max_length)
    inputs = inputs.to(model.device)
    with torch.no_grad():
        outputs = model.generate(**inputs,use_cache=False, max_length=max_length, num_beams=5)
    return tokenizer.batch_decode(outputs, skip_special_tokens=True)

# Batched translation
all_preds, all_refs = [], []
batch_size = 16

for i in range(0, len(src_texts), batch_size):
    batch_src = src_texts[i:i + batch_size]
    batch_ref = tgt_texts[i:i + batch_size]

    preds = translate_batch(batch_src)
    all_preds.extend(preds)
    all_refs.extend([[ref] for ref in batch_ref])

# Strip whitespace
all_preds = [p.strip() for p in all_preds]
all_refs_flat = [ref[0].strip() for ref in all_refs]

# Compute metrics
bleu_score = bleu.compute(predictions=all_preds, references=all_refs)["bleu"]
rouge_score = rouge.compute(predictions=all_preds, references=all_refs_flat)
chrf_score = chrf.compute(predictions=all_preds, references=all_refs_flat)

# Results
print("\n📊 IndicTrans2-small on Itihāsa Test Set:")
print(f"BLEU:     {bleu_score:.4f}")
print(f"ROUGE-1:  {rouge_score['rouge1']:.4f}")
print(f"ROUGE-2:  {rouge_score['rouge2']:.4f}")
print(f"ROUGE-L:  {rouge_score['rougeL']:.4f}")
print(f"chrF++:   {chrf_score['score']:.4f}")

# Sample outputs
print("\n🔍 Sample translations:")
for src, ref, pred in zip(src_texts_raw[:3], all_refs_flat[:3], all_preds[:3]):
    print(f"\nSRC:  {src}")
    print(f"PRED: {pred}")
    print(f"REF:  {ref}")


tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenization_indictrans.py: 0.00B [00:00, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/ai4bharat/indictrans2-en-indic-dist-200M:
- tokenization_indictrans.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


dict.SRC.json: 0.00B [00:00, ?B/s]

dict.TGT.json: 0.00B [00:00, ?B/s]

model.SRC:   0%|          | 0.00/759k [00:00<?, ?B/s]

model.TGT:   0%|          | 0.00/3.26M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/96.0 [00:00<?, ?B/s]

config.json: 0.00B [00:00, ?B/s]

configuration_indictrans.py: 0.00B [00:00, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/ai4bharat/indictrans2-en-indic-dist-200M:
- configuration_indictrans.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_indictrans.py: 0.00B [00:00, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/ai4bharat/indictrans2-en-indic-dist-200M:
- modeling_indictrans.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors:   0%|          | 0.00/1.10G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/163 [00:00<?, ?B/s]


📊 IndicTrans2-small on Itihāsa Test Set:
BLEU:     0.0002
ROUGE-1:  0.0000
ROUGE-2:  0.0000
ROUGE-L:  0.0000
chrF++:   23.6763

🔍 Sample translations:

SRC:  Hearing the words of Viśvāmitra, Rāghava, together with Laksmana, was struck with amazement, and spoke to Viśvāmitra, saying,
PRED: लक्ष्मणेन सह, विश्वामित्ररघवस्य वचनानि श्रुत्वा, सः विस्मितः भूत्वा, विश्वामित्रेन सह उक्तवान् ।
REF:  विश्वामित्रवचः श्रुत्वा राघवः सहलक्ष्मणः। विस्मयं परमं गत्वा विश्वामित्रमथाब्रवीत्॥

SRC:  O Brāhmaṇa, wonderful is the story that you have recited to us, viz; that of Ganga's sacred dissension and the replenishing of the Ocean.
PRED: हे ब्राह्मणः अद्भुतः कथा अस्ति या भवान् गङ्गायाः पवित्रविवेचनस्य, समुद्रस्य पुनर्भरणस्य च कथां पठितवान् ।
REF:  अत्यद्भुतमिदं ब्रह्मन् कथितं परमं त्वया। गङ्गावतरणं पुण्यं सागरस्यापि पूरणम्॥

SRC:  And, O afflicter of foes, as we had been reflecting upon all this at length, the night has passed away as if it were as moment.
PRED: हे शत्रूनां पीडितः, यतोहि वयं एतस्मिन् व

In [5]:
print(ds.features)
print(ds[0]["translation"])

{'translation': Translation(languages=['sn', 'en'], id=None)}
{'en': 'Hearing the words of Viśvāmitra, Rāghava, together with Laksmana, was struck with amazement, and spoke to Viśvāmitra, saying,', 'sn': 'विश्वामित्रवचः श्रुत्वा राघवः सहलक्ष्मणः। विस्मयं परमं गत्वा विश्वामित्रमथाब्रवीत्॥'}
