<a href="https://colab.research.google.com/github/TechKnight10/Machine-translation_using-IndicTRANS2-with-Flores200-devtest/blob/main/MTwithIndicTRANS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import torch
import pandas as pd
from transformers import AutoModelForSeq2SeqLM, BitsAndBytesConfig, AutoTokenizer
from IndicTransToolkit import IndicProcessor

In [23]:
citation = "@article{gala2023indictrans, title={IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages}, author={Jay Gala and Pranjal A Chitale and A K Raghavan and Varun Gumma and Sumanth Doddapaneni and Aswanth Kumar M and Janki Atul Nawale and Anupama Sujatha and Ratish Puduppully and Vivek Raghavan and Pratyush Kumar and Mitesh M Khapra and Raj Dabre and Anoop Kunchukuttan}, journal={Transactions on Machine Learning Research}, ssn={2835-8856}, year={2023}, url=(https://openreview.net/forum?id=vfT4YuzAYA), note={}}"

In [2]:
BATCH_SIZE = 4
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
quantization = None

In [3]:
def initialize_model_and_tokenizer(ckpt_dir, quantization):
    if quantization == "4-bit":
        qconfig = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_use_double_quant=True,
            bnb_4bit_compute_dtype=torch.bfloat16,
        )
    elif quantization == "8-bit":
        qconfig = BitsAndBytesConfig(
            load_in_8bit=True,
            bnb_8bit_use_double_quant=True,
            bnb_8bit_compute_dtype=torch.bfloat16,
        )
    else:
        qconfig = None

    tokenizer = AutoTokenizer.from_pretrained(ckpt_dir, trust_remote_code=True)
    model = AutoModelForSeq2SeqLM.from_pretrained(
        ckpt_dir,
        trust_remote_code=True,
        low_cpu_mem_usage=True,
        quantization_config=qconfig,
    )

    if qconfig == None:
        model = model.to(DEVICE)
        if DEVICE == "cuda":
            model.half()

    model.eval()

    return tokenizer, model


def batch_translate(input_sentences, src_lang, tgt_lang, model, tokenizer, ip):
    translations = []
    for i in range(0, len(input_sentences), BATCH_SIZE):
        batch = input_sentences[i : i + BATCH_SIZE]

        # Preprocess the batch and extract entity mappings
        batch = ip.preprocess_batch(batch, src_lang=src_lang, tgt_lang=tgt_lang)

        # Tokenize the batch and generate input encodings
        inputs = tokenizer(
            batch,
            truncation=True,
            padding="longest",
            return_tensors="pt",
            return_attention_mask=True,
        ).to(DEVICE)

        # Generate translations using the model
        with torch.no_grad():
            generated_tokens = model.generate(
                **inputs,
                use_cache=True,
                min_length=0,
                max_length=256,
                num_beams=5,
                num_return_sequences=1,
            )

        # Decode the generated tokens into text

        with tokenizer.as_target_tokenizer():
            generated_tokens = tokenizer.batch_decode(
                generated_tokens.detach().cpu().tolist(),
                skip_special_tokens=True,
                clean_up_tokenization_spaces=True,
            )

        # Postprocess the translations, including entity replacement
        translations += ip.postprocess_batch(generated_tokens, lang=tgt_lang)

        del inputs
        torch.cuda.empty_cache()

    return translations

In [4]:
en_indic_ckpt_dir = "ai4bharat/indictrans2-en-indic-1B"  # ai4bharat/indictrans2-en-indic-dist-200M
en_indic_tokenizer, en_indic_model = initialize_model_and_tokenizer(en_indic_ckpt_dir, quantization)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [27]:
file_path_1 = 'parallel_dataset_eng_hin.xlsx'
try:
    df1 = pd.read_excel(file_path_1)
    print(df1.head())
except FileNotFoundError:
    print(f"The file {file_path_1} was not found.")

file_path_2 = 'parallel_dataset_eng_ben.xlsx'
try:
    df2 = pd.read_excel(file_path_2)
    print(df2.head())
except FileNotFoundError:
    print(f"The file {file_path_2} was not found.")


file_path_3 = 'parallel_dataset_eng_guj.xlsx'
try:
    df3 = pd.read_excel(file_path_3)
    print(df3.head())
except FileNotFoundError:
    print(f"The file {file_path_3} was not found.")


file_path_4 = 'parallel_dataset_eng_mar.xlsx'
try:
    df4 = pd.read_excel(file_path_4)
    print(df4.head())
except FileNotFoundError:
    print(f"The file {file_path_4} was not found.")


file_path_5 = 'parallel_dataset_eng_tam.xlsx'
try:
    df5 = pd.read_excel(file_path_5)
    print(df5.head())
except FileNotFoundError:
    print(f"The file {file_path_5} was not found.")


file_path_6 = 'parallel_dataset_eng_tel.xlsx'
try:
    df6 = pd.read_excel(file_path_6)
    print(df6.head())
except FileNotFoundError:
    print(f"The file {file_path_6} was not found.")

                                             English  \
0  Vatican City's population is around 800. It is...   
1   All citizens of Vatican City are Roman Catholic.   
2  It has a notably wide variety of plant communi...   
3  The Amazon River is the second longest and the...   
4  Lion prides act much like packs of wolves or d...   

                                         Translation  
0  वेटिकन सिटी की जनसंख्या लगभग 800 है. यह विश्व ...  
1        वेटिकन सिटी के सभी नागरिक रोमन कैथोलिक हैं.  
2  इसकी सूक्ष्म जलवायु श्रेणी श्रृंखलाओं, भिन्न म...  
3  अमेज़न नदी धरती की दूसरी सबसे लंबी और सबसे बड़...  
4  शेरों का समूह काफी हद तक भेड़ियों या कुत्तों क...  
                                             English  \
0  Vatican City's population is around 800. It is...   
1   All citizens of Vatican City are Roman Catholic.   
2  It has a notably wide variety of plant communi...   
3  The Amazon River is the second longest and the...   
4  Lion prides act much like packs of wolves or d... 

In [7]:
ip = IndicProcessor(inference=True)
input_sentences = df1['English'].tolist()
src_lang, tgt_lang = "eng_Latn", "hin_Deva"
translations = batch_translate(input_sentences, src_lang, tgt_lang, en_indic_model, en_indic_tokenizer, ip)

print(f"\n{src_lang} - {tgt_lang}")
for input_sentence, translation in zip(input_sentences, translations):
    print(f"{src_lang}: {input_sentence}")
    print(f"{tgt_lang}: {translation}")




eng_Latn - hin_Deva
eng_Latn: Vatican City's population is around 800. It is the smallest independent country in the world and the country with the lowest population.
hin_Deva: वेटिकन सिटी की आबादी लगभग 800 है। यह दुनिया का सबसे छोटा स्वतंत्र देश है और सबसे कम आबादी वाला देश है। 
eng_Latn: All citizens of Vatican City are Roman Catholic.
hin_Deva: वेटिकन सिटी के सभी नागरिक रोमन कैथोलिक हैं। 
eng_Latn: It has a notably wide variety of plant communities, due to its range of microclimates, differing soils and varying levels of altitude.
hin_Deva: सूक्ष्म जलवायु की अपनी श्रृंखला, अलग-अलग मिट्टी और ऊंचाई के अलग-अलग स्तरों के कारण इसमें पौधों के समुदायों की एक विस्तृत विविधता है। 
eng_Latn: The Amazon River is the second longest and the biggest river on Earth. It carries more than 8 times as much water as the second biggest river.
hin_Deva: अमेज़न नदी पृथ्वी की दूसरी सबसे लंबी और सबसे बड़ी नदी है। यह दूसरी सबसे बड़ी नदी की तुलना में 8 गुना अधिक पानी ले जाती है। 
eng_Latn: Lion prides act mu

In [10]:
translation_df = pd.DataFrame({
    'English': input_sentences,
    'Hindi': translations
})

# Save the DataFrame to a new Excel file
output_file = 'machinetranslation_eng_hin_indictrans2.xlsx'
translation_df.to_excel(output_file, index=False)

print(f"Translations saved to {output_file}")

Translations saved to machinetranslation_eng_hin_indictrans2.xlsx


In [16]:
ip = IndicProcessor(inference=True)
input_sentences = df2['English'].tolist()
src_lang, tgt_lang = "eng_Latn", "ben_Beng"
translations = batch_translate(input_sentences, src_lang, tgt_lang, en_indic_model, en_indic_tokenizer, ip)

print(f"\n{src_lang} - {tgt_lang}")
for input_sentence, translation in zip(input_sentences, translations):
    print(f"{src_lang}: {input_sentence}")
    print(f"{tgt_lang}: {translation}")

translation_df = pd.DataFrame({
    'English': input_sentences,
    'Bengali': translations
})

# Save the DataFrame to a new Excel file
output_file = 'machinetranslation_eng_ben_indictrans2.xlsx'
translation_df.to_excel(output_file, index=False)

print(f"Translations saved to {output_file}")




eng_Latn - ben_Beng
eng_Latn: Vatican City's population is around 800. It is the smallest independent country in the world and the country with the lowest population.
ben_Beng: ভ্যাটিকান সিটির জনসংখ্যা প্রায় 800। এটি বিশ্বের ক্ষুদ্রতম স্বাধীন দেশ এবং সর্বনিম্ন জনসংখ্যার দেশ। 
eng_Latn: All citizens of Vatican City are Roman Catholic.
ben_Beng: ভ্যাটিকান সিটির সমস্ত নাগরিক রোমান ক্যাথলিক। 
eng_Latn: It has a notably wide variety of plant communities, due to its range of microclimates, differing soils and varying levels of altitude.
ben_Beng: মাইক্রোক্লাইমেটের পরিসীমা, বিভিন্ন মৃত্তিকা এবং উচ্চতার বিভিন্ন স্তরের কারণে এখানে উল্লেখযোগ্যভাবে বিভিন্ন ধরনের উদ্ভিদ সম্প্রদায় রয়েছে। 
eng_Latn: The Amazon River is the second longest and the biggest river on Earth. It carries more than 8 times as much water as the second biggest river.
ben_Beng: আমাজন নদী পৃথিবীর দ্বিতীয় দীর্ঘতম এবং বৃহত্তম নদী। এটি দ্বিতীয় বৃহত্তম নদীর চেয়ে 8 গুণ বেশি জল বহন করে। 
eng_Latn: Lion prides act much like pack

In [25]:
ip = IndicProcessor(inference=True)
input_sentences = df3['English'].tolist()
src_lang, tgt_lang = "eng_Latn", "guj_Gujr"
translations = batch_translate(input_sentences, src_lang, tgt_lang, en_indic_model, en_indic_tokenizer, ip)

print(f"\n{src_lang} - {tgt_lang}")
for input_sentence, translation in zip(input_sentences, translations):
    print(f"{src_lang}: {input_sentence}")
    print(f"{tgt_lang}: {translation}")

translation_df = pd.DataFrame({
    'English': input_sentences,
    'Gujarati': translations
})

# Save the DataFrame to a new Excel file
output_file = 'machinetranslation_eng_guj_indictrans2.xlsx'
translation_df.to_excel(output_file, index=False)

print(f"Translations saved to {output_file}")




eng_Latn - guj_Gujr
eng_Latn: Vatican City's population is around 800. It is the smallest independent country in the world and the country with the lowest population.
guj_Gujr: વેટિકન સિટીની વસ્તી આશરે 800 છે. તે વિશ્વનો સૌથી નાનો સ્વતંત્ર દેશ છે અને સૌથી ઓછી વસ્તી ધરાવતો દેશ છે. 
eng_Latn: All citizens of Vatican City are Roman Catholic.
guj_Gujr: વેટિકન સિટીના તમામ નાગરિકો રોમન કેથોલિક છે. 
eng_Latn: It has a notably wide variety of plant communities, due to its range of microclimates, differing soils and varying levels of altitude.
guj_Gujr: તેની સૂક્ષ્મ આબોહવાની શ્રેણી, જુદી જુદી જમીન અને ઊંચાઈના વિવિધ સ્તરને કારણે તે નોંધપાત્ર રીતે વનસ્પતિ સમુદાયોની વિશાળ વિવિધતા ધરાવે છે. 
eng_Latn: The Amazon River is the second longest and the biggest river on Earth. It carries more than 8 times as much water as the second biggest river.
guj_Gujr: એમેઝોન નદી પૃથ્વી પરની બીજી સૌથી લાંબી અને સૌથી મોટી નદી છે. તે બીજી સૌથી મોટી નદી કરતાં 8 ગણી વધુ પાણી વહન કરે છે. 
eng_Latn: Lion prides act much 

In [28]:
ip = IndicProcessor(inference=True)
input_sentences = df4['English'].tolist()
src_lang, tgt_lang = "eng_Latn", "mar_Deva"
translations = batch_translate(input_sentences, src_lang, tgt_lang, en_indic_model, en_indic_tokenizer, ip)

print(f"\n{src_lang} - {tgt_lang}")
for input_sentence, translation in zip(input_sentences, translations):
    print(f"{src_lang}: {input_sentence}")
    print(f"{tgt_lang}: {translation}")

translation_df = pd.DataFrame({
    'English': input_sentences,
    'Marathi': translations
})

# Save the DataFrame to a new Excel file
output_file = 'machinetranslation_eng_mar_indictrans2.xlsx'
translation_df.to_excel(output_file, index=False)

print(f"Translations saved to {output_file}")




eng_Latn - mar_Deva
eng_Latn: Vatican City's population is around 800. It is the smallest independent country in the world and the country with the lowest population.
mar_Deva: व्हॅटिकन सिटीची लोकसंख्या सुमारे 800 आहे. हा जगातील सर्वात लहान स्वतंत्र देश आहे आणि सर्वात कमी लोकसंख्या असलेला देश आहे. 
eng_Latn: All citizens of Vatican City are Roman Catholic.
mar_Deva: व्हॅटिकन सिटीचे सर्व नागरिक रोमन कॅथलिक आहेत. 
eng_Latn: It has a notably wide variety of plant communities, due to its range of microclimates, differing soils and varying levels of altitude.
mar_Deva: सूक्ष्म हवामानाची व्याप्ती, वेगवेगळी माती आणि उंचीच्या वेगवेगळ्या पातळ्यांमुळे येथे वनस्पतींचे समुदाय लक्षणीयरीत्या वैविध्यपूर्ण आहेत. 
eng_Latn: The Amazon River is the second longest and the biggest river on Earth. It carries more than 8 times as much water as the second biggest river.
mar_Deva: अमेझॉन नदी ही पृथ्वीवरील दुसरी सर्वात लांब आणि सर्वात मोठी नदी आहे. ती दुसऱ्या सर्वात मोठ्या नदीपेक्षा 8 पट जास्त पाणी वाहून नेता

In [29]:
ip = IndicProcessor(inference=True)
input_sentences = df5['English'].tolist()
src_lang, tgt_lang = "eng_Latn", "tam_Taml"
translations = batch_translate(input_sentences, src_lang, tgt_lang, en_indic_model, en_indic_tokenizer, ip)

print(f"\n{src_lang} - {tgt_lang}")
for input_sentence, translation in zip(input_sentences, translations):
    print(f"{src_lang}: {input_sentence}")
    print(f"{tgt_lang}: {translation}")

translation_df = pd.DataFrame({
    'English': input_sentences,
    'Tamil': translations
})

# Save the DataFrame to a new Excel file
output_file = 'machinetranslation_eng_tam_indictrans2.xlsx'
translation_df.to_excel(output_file, index=False)

print(f"Translations saved to {output_file}")


eng_Latn - tam_Taml
eng_Latn: Vatican City's population is around 800. It is the smallest independent country in the world and the country with the lowest population.
tam_Taml: வத்திக்கான் நகரத்தின் மக்கள் தொகை சுமார் 800 ஆகும். இது உலகின் மிகச்சிறிய சுதந்திர நாடு மற்றும் மிகக் குறைந்த மக்கள் தொகை கொண்ட நாடு ஆகும். 
eng_Latn: All citizens of Vatican City are Roman Catholic.
tam_Taml: வத்திக்கான் நகரத்தின் குடிமக்கள் அனைவரும் ரோமன் கத்தோலிக்கர்கள் ஆவர். 
eng_Latn: It has a notably wide variety of plant communities, due to its range of microclimates, differing soils and varying levels of altitude.
tam_Taml: மைக்ரோக்ளைமேட்டுகளின் வரம்பு, மாறுபட்ட மண் மற்றும் மாறுபட்ட உயர நிலைகள் காரணமாக இது குறிப்பிடத்தக்க வகையில் பல்வேறு வகையான தாவர சமூகங்களைக் கொண்டுள்ளது. 
eng_Latn: The Amazon River is the second longest and the biggest river on Earth. It carries more than 8 times as much water as the second biggest river.
tam_Taml: அமேசான் ஆறு பூமியின் இரண்டாவது நீளமான மற்றும் மிகப்பெரிய நதியாகும். இ

In [30]:
ip = IndicProcessor(inference=True)
input_sentences = df6['English'].tolist()
src_lang, tgt_lang = "eng_Latn", "tel_Telu"
translations = batch_translate(input_sentences, src_lang, tgt_lang, en_indic_model, en_indic_tokenizer, ip)

print(f"\n{src_lang} - {tgt_lang}")
for input_sentence, translation in zip(input_sentences, translations):
    print(f"{src_lang}: {input_sentence}")
    print(f"{tgt_lang}: {translation}")

translation_df = pd.DataFrame({
    'English': input_sentences,
    'Telugu': translations
})

# Save the DataFrame to a new Excel file
output_file = 'machinetranslation_eng_tel_indictrans2.xlsx'
translation_df.to_excel(output_file, index=False)

print(f"Translations saved to {output_file}")


eng_Latn - tel_Telu
eng_Latn: Vatican City's population is around 800. It is the smallest independent country in the world and the country with the lowest population.
tel_Telu: వాటికన్ సిటీ జనాభా సుమారు 800. ఇది ప్రపంచంలోనే అతి చిన్న స్వతంత్ర దేశం మరియు అతి తక్కువ జనాభా కలిగిన దేశం. 
eng_Latn: All citizens of Vatican City are Roman Catholic.
tel_Telu: వాటికన్ సిటీ పౌరులందరూ రోమన్ కాథలిక్కులు. 
eng_Latn: It has a notably wide variety of plant communities, due to its range of microclimates, differing soils and varying levels of altitude.
tel_Telu: సూక్ష్మ వాతావరణాల పరిధి, విభిన్న నేలలు మరియు వివిధ స్థాయిల ఎత్తుల కారణంగా ఇది అనేక రకాల మొక్కల సమూహాలను కలిగి ఉంది. 
eng_Latn: The Amazon River is the second longest and the biggest river on Earth. It carries more than 8 times as much water as the second biggest river.
tel_Telu: అమెజాన్ నది భూమిపై రెండవ పొడవైన మరియు అతిపెద్ద నది. ఇది రెండవ అతిపెద్ద నది కంటే 8 రెట్లు ఎక్కువ నీటిని తీసుకువెళుతుంది. 
eng_Latn: Lion prides act much like packs of w