# Create a Multilingual Speech Translation

Here we are going to create a multilingual text translation using IndicTrans2 models which were originally trained with the fairseq to HuggingFace transformers for inference purpose.

IndicTrans2 is a Transformer made by Voluteers for AI4Bharat which can translate 22 languages in India.


## Necessary Step

Please run the cells below to install the necessary dependencies.

<font color='red'>DO NOT CHNAGE ANY CODE GIVEN BELOW</font>

In [1]:
!export CUDA_LAUNCH_BLOCKING=1

In [2]:
# Clone the required Git repository for IndicTrans2
%%capture
!git clone https://github.com/AI4Bharat/IndicTrans2.git

In [3]:
# Clone the Hugging face interface from github
%%capture
%cd /content/IndicTrans2/huggingface_interface

In [4]:
# Install other essential dependecies for working of the transformer
%%capture
!python3 -m pip install nltk sacremoses pandas regex mock transformers>=4.33.2 mosestokenizer
!python3 -c "import nltk; nltk.download('punkt')"
!python3 -m pip install bitsandbytes scipy accelerate datasets
!python3 -m pip install sentencepiece

!git clone https://github.com/VarunGumma/IndicTransTokenizer
%cd IndicTransTokenizer
!python3 -m pip install --editable ./
%cd ..

**IMPORTANT : Restart your run-time first and then run the cells below.**

## Working for Transformer


1. Import the followings that you have installed in the previous section:
  * transformer
  * torch
  * AutoModelForSeq2SeqLM from transformer
  * BitsAndBytesConfig from transformer
  * IndicProcessor from from IndicTransTokenizer
  * IndicTransTokenizer from IndicTransTokenizer

In [1]:
# import essentials
import transformers
import torch
from transformers import BitsAndBytesConfig
from IndicTransTokenizer import IndicTransTokenizer
from IndicTransTokenizer import IndicProcessor
from transformers import AutoModelForSeq2SeqLM

2. Set the Batch size equal to 4. Then create a variable DEVICE and set it to "cuda" if torch.cuda.is_available() or else set it as "cpu". Finally set Quantization as "None"

In [2]:
# Set the variables
BATCH_SIZE = 4
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
QUANTIZATION = None

3. We are going to create Two functions.
    * Create a function to intialise the model and the tokenizer and returns both
    * Create another function which will help in the translation of a whole batch.


### Creating the model initializer and tokenizer function.


Create a function initialize_model_and_tokenizer which takes in 4 arguments: ckpt_dir, direction, quantization.
Inside the function, if quantization  = '4-bit' then create a variable qconfig and use appropriate BitsAndByteConfig to instantiate it. Else if quantization  = '8-bit', then do the necessary. Else, set it to None.

(For more read the following documentation on [BitsAndByteConfig](https://huggingface.co/docs/transformers/en/main_classes/quantization#transformers.BitsAndBytesConfig).)

After the conditional flow, create a variable tokenizer

Next step will be to create a model variable set to AutoModelForSeq2SeqLM where we have to load the pretrained model from checkpoint directory

In [3]:
# Create a function initialize_model_and_tokenizer which takes in 4 arguments: ckpt_dir, direction, quantization.

def initialize_model_and_tokenizer(ckpt_dir, direction, quantization):
    # if quantization  = '4-bit'
    if quantization == '4-bit':
        qconfig = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_use_double_quant=True,
            bnb_4bit_compute_dtype=torch.bfloat16
        )
    # Else if quantization  = '8-bit'
    elif quantization == '8-bit':
        qconfig = BitsAndBytesConfig(
            load_in_8bit=True,
            bnb_8bit_use_double_quant=True,
            bnb_8bit_compute_dtype=torch.bfloat16
        )
    # Else, set it to None.
    else:
        qconfig = None

    # Create a variable tokenizer and set it as IndicTransTokenizer with direction set as direction.

    tokenizer = IndicTransTokenizer(direction=direction)


    # Create a model variable set to AutoModelForSeq2SeqLM
    # Keep trust_remote_code=True, low_cpu_mem_usage=True and quantization_config=qconfig.

    model = AutoModelForSeq2SeqLM.from_pretrained(ckpt_dir, trust_remote_code=True, low_cpu_mem_usage=True, quantization_config=qconfig)
    return tokenizer, model




    # if qconfig is none, save the model in device.
    if qconfig == None:
        model = model.to(DEVICE)
        if DEVICE == "cuda":
            model.half()

    # Evaluate the model
    model.eval()


    # return both tokenizer and model
    return tokenizer, model

### Helper Function to get translation

In [4]:
def batch_translate(input_sentences, src_lang, tgt_lang, model, tokenizer, ip):
    translations = []
    for i in range(0, len(input_sentences), BATCH_SIZE):
        batch = input_sentences[i : i + BATCH_SIZE]

        # Preprocess the batch and extract entity mappings
        batch = ip.preprocess_batch(batch, src_lang=src_lang, tgt_lang=tgt_lang)

        # Tokenize the batch and generate input encodings
        inputs = tokenizer(
            batch,
            src=True,
            truncation=True,
            padding="longest",
            return_tensors="pt",
            return_attention_mask=True,
        ).to(DEVICE)

        # Generate translations using the model
        with torch.no_grad():
            generated_tokens = model.generate(
                **inputs,
                use_cache=True,
                min_length=0,
                max_length=256,
                num_beams=5,
                num_return_sequences=1,
            )

        # Decode the generated tokens into text
        generated_tokens = tokenizer.batch_decode(generated_tokens.detach().cpu().tolist(), src=False)

        # Postprocess the translations, including entity replacement
        translations += ip.postprocess_batch(generated_tokens, lang=tgt_lang)

        del inputs
        torch.cuda.empty_cache()

    return translations

### English to Indic Example


Provided below are some example sentences

In [5]:
# sample sentences
en_sents = [
    "Akshat is very bad boy.",
    "When I was young, I used to go to the park every day.",
    "He has many old books, which he inherited from his ancestors.",
    "I can't figure out how to solve my problem.",
    "She is very hardworking and intelligent, which is why she got all the good marks.",
    "We watched a new movie last week, which was very inspiring.",
    "If you had met me at that time, we would have gone out to eat.",
    "She went to the market with her sister to buy a new sari.",
    "Raj told me that he is going to his grandmother's house next month.",
    "All the kids were having fun at the party and were eating lots of sweets.",
    "My friend has invited me to his birthday party, and I will give him a gift.",
]

hi_sents = [
    "जब मैं छोटा था, मैं हर रोज़ पार्क जाता था।",
    "उसके पास बहुत सारी पुरानी किताबें हैं, जिन्हें उसने अपने दादा-परदादा से विरासत में पाया।",
    "मुझे समझ में नहीं आ रहा कि मैं अपनी समस्या का समाधान कैसे ढूंढूं।",
    "वह बहुत मेहनती और समझदार है, इसलिए उसे सभी अच्छे मार्क्स मिले।",
    "हमने पिछले सप्ताह एक नई फिल्म देखी जो कि बहुत प्रेरणादायक थी।",
    "अगर तुम मुझे उस समय पास मिलते, तो हम बाहर खाना खाने चलते।",
    "वह अपनी दीदी के साथ बाजार गयी थी ताकि वह नई साड़ी खरीद सके।",
    "राज ने मुझसे कहा कि वह अगले महीने अपनी नानी के घर जा रहा है।",
    "सभी बच्चे पार्टी में मज़ा कर रहे थे और खूब सारी मिठाइयाँ खा रहे थे।",
    "मेरे मित्र ने मुझे उसके जन्मदिन की पार्टी में बुलाया है, और मैं उसे एक तोहफा दूंगा।",
]

Now we have to Finally join all the functions and datasets together to create our own predictions.

Here is the list of languages supported by the IndicTrans2 models:

| Language                       | Code      | Language                        | Code      | Language                       | Code      |
|--------------------------------|-----------|---------------------------------|-----------|--------------------------------|-----------|
| Assamese                       | asm_Beng  | Kashmiri (Arabic)               | kas_Arab  | Punjabi                        | pan_Guru  |
| Bengali                        | ben_Beng  | Kashmiri (Devanagari)           | kas_Deva  | Sanskrit                       | san_Deva  |
| Bodo                           | brx_Deva  | Maithili                        | mai_Deva  | Santali                        | sat_Olck  |
| Dogri                          | doi_Deva  | Malayalam                       | mal_Mlym  | Sindhi (Arabic)                | snd_Arab  |
| English                        | eng_Latn  | Marathi                         | mar_Deva  | Sindhi (Devanagari)            | snd_Deva  |
| Konkani                        | gom_Deva  | Manipuri (Bengali)              | mni_Beng  | Tamil                          | tam_Taml  |
| Gujarati                       | guj_Gujr  | Manipuri (Meitei)               | mni_Mtei  | Telugu                         | tel_Telu  |
| Hindi                          | hin_Deva  | Nepali                          | npi_Deva  | Urdu                           | urd_Arab  |
| Kannada                        | kan_Knda  | Odia                            | ory_Orya  |                                |           |


# en to telugu

In [11]:
# Create a variable to store "ai4bharat/indictrans2-en-indic-1B" as checkpoint directory
ckpt_dir = "ai4bharat/indictrans2-en-indic-1B"



# get the tokenizer and model by passing essential arguments to initialize_model_and_tokenizer function
en_indic_tokenizer, en_indic_model = initialize_model_and_tokenizer(ckpt_dir, direction="en-indic", quantization='8-bit')


# instantiate IndicProcessor with inference = True
en_indic_ip = IndicProcessor(inference=True)



# Choose the source langauge as English and target language as Hindi.
src_lang = "eng_Latn"
tgt_lang = "tel_Telu"
# src_lang = "hin_Deva"
# tgt_lang = "eng_Latn"


input_sents = [
    "My name is Deepak.",
    "I am pursuing a BTech degree in Data Science and AI from IIT Bhilai.",
    "Currently, I am in my pre-final year.",
    "I am passionate about machine learning and artificial intelligence.",
    "In my free time, I enjoy coding and working on various data science projects.",
    "I have experience with programming languages like C++, Python, and SQL.",
    "I am skilled in using frameworks such as Spring Boot and tools for natural language processing.",
    "My goal is to gain experience through internships and contribute to innovative projects in the field of AI.",
    "I actively participate in discussions related to software engineering and data science.",
    "Outside of academics, I like to explore new technologies and stay updated with industry trends.",
]


# Find target translation using the batch_translate function
la_pred = batch_translate(input_sents, src_lang, tgt_lang, en_indic_model, en_indic_tokenizer, en_indic_ip)


print(f"{src_lang}-{tgt_lang}")
# Print input sentence and its translation.
for i in range(len(input_sents)):
    print(f"{src_lang} {input_sents[i]}")
    print(f"{tgt_lang} {la_pred[i]}")
    print()


# flush the models to free the GPU memory
del en_indic_tokenizer, en_indic_model

Unused kwargs: ['bnb_8bit_use_double_quant', 'bnb_8bit_compute_dtype']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.
The official Tokenizer is available on HF and can be used as follows:
```
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
```
  tokenizer = IndicTransTokenizer(direction=direction)


eng_Latn-tel_Telu
eng_Latn My name is Deepak.
tel_Telu నా పేరు దీపక్.

eng_Latn I am pursuing a BTech degree in Data Science and AI from IIT Bhilai.
tel_Telu నేను ఐఐటి భిలాయ్ నుండి డేటా సైన్స్ మరియు ఎఐలో బిటెక్ డిగ్రీ చదువుతున్నాను.

eng_Latn Currently, I am in my pre-final year.
tel_Telu ప్రస్తుతం, నేను నా ప్రీ-ఫైనల్ సంవత్సరంలో ఉన్నాను.

eng_Latn I am passionate about machine learning and artificial intelligence.
tel_Telu నాకు మెషిన్ లెర్నింగ్ మరియు ఆర్టిఫిషియల్ ఇంటెలిజెన్స్ పట్ల మక్కువ ఉంది.

eng_Latn In my free time, I enjoy coding and working on various data science projects.
tel_Telu నా ఖాళీ సమయంలో, నేను కోడింగ్ మరియు వివిధ డేటా సైన్స్ ప్రాజెక్టులలో పనిచేయడం ఆనందిస్తాను.

eng_Latn I have experience with programming languages like C++, Python, and SQL.
tel_Telu నాకు సి + +, పైథాన్ మరియు ఎస్ క్యూ ఎల్ వంటి ప్రోగ్రామింగ్ భాషలలో అనుభవం ఉంది.

eng_Latn I am skilled in using frameworks such as Spring Boot and tools for natural language processing.
tel_Telu స్ప్రింగ్ బూట్ వంటి ఫ్రేమ్వర్క్

##English to Maithili

In [12]:
# Create a variable to store "ai4bharat/indictrans2-en-indic-1B" as checkpoint directory
ckpt_dir = "ai4bharat/indictrans2-en-indic-1B"



# get the tokenizer and model by passing essential arguments to initialize_model_and_tokenizer function
en_indic_tokenizer, en_indic_model = initialize_model_and_tokenizer(ckpt_dir, direction="en-indic", quantization='8-bit')


# instantiate IndicProcessor with inference = True
en_indic_ip = IndicProcessor(inference=True)



# Choose the source langauge as English and target language as Hindi.
src_lang = "eng_Latn"
tgt_lang = "mai_Deva"
# src_lang = "hin_Deva"
# tgt_lang = "eng_Latn"


input_sents = [
    "My name is Deepak.",
    "I am pursuing a BTech degree in Data Science and AI from IIT Bhilai.",
    "Currently, I am in my pre-final year.",
    "I am passionate about machine learning and artificial intelligence.",
    "In my free time, I enjoy coding and working on various data science projects.",
    "I have experience with programming languages like C++, Python, and SQL.",
    "I am skilled in using frameworks such as Spring Boot and tools for natural language processing.",
    "My goal is to gain experience through internships and contribute to innovative projects in the field of AI.",
    "I actively participate in discussions related to software engineering and data science.",
    "Outside of academics, I like to explore new technologies and stay updated with industry trends.",
]

# Find target translation using the batch_translate function
la_pred = batch_translate(input_sents, src_lang, tgt_lang, en_indic_model, en_indic_tokenizer, en_indic_ip)


print(f"{src_lang}-{tgt_lang}")
# Print input sentence and its translation.
for i in range(len(input_sents)):
    print(f"{src_lang} {input_sents[i]}")
    print(f"{tgt_lang}maithili predicted langugae:{la_pred[i]}")
    print()


# flush the models to free the GPU memory
del en_indic_tokenizer, en_indic_model

Unused kwargs: ['bnb_8bit_use_double_quant', 'bnb_8bit_compute_dtype']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.
The official Tokenizer is available on HF and can be used as follows:
```
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
```
  tokenizer = IndicTransTokenizer(direction=direction)


eng_Latn-mai_Deva
eng_Latn My name is Deepak.
mai_Devamaithili predicted langugae:हमर नाम दीपक अछि।

eng_Latn I am pursuing a BTech degree in Data Science and AI from IIT Bhilai.
mai_Devamaithili predicted langugae:हम आई. आई. टी. भिलाई सँ डेटा साइंस आ ए. आई. मे बी. टेक डिग्री प्राप्त कऽ रहल छी।

eng_Latn Currently, I am in my pre-final year.
mai_Devamaithili predicted langugae:वर्तमान मे हम अपन प्री-फाइनल ईयर मे छी।

eng_Latn I am passionate about machine learning and artificial intelligence.
mai_Devamaithili predicted langugae:हम मशीन लर्निंग आ आर्टिफिशियल इंटेलिजेंस के बारे में भावुक छी।

eng_Latn In my free time, I enjoy coding and working on various data science projects.
mai_Devamaithili predicted langugae:खाली समय मे हमरा कोडिंग आ विभिन्न डेटा विज्ञान परियोजना पर काज करबामे मजा अबैत अछि।

eng_Latn I have experience with programming languages like C++, Python, and SQL.
mai_Devamaithili predicted langugae:हमरा सी + +, पायथन, आ एसक्यूएल सन प्रोग्रामिंग भाषाक अनुभव अछि।

eng_Latn I a

***