# Building an Indic Translation Pipeline with NeMo Curator

In this tutorial, we use the [IndicTransToolkit](https://github.com/VarunGumma/IndicTransToolkit) library, [IndicTrans2](https://huggingface.co/ai4bharat/indictrans2-en-indic-1B) model from Hugging Face, [Dask](https://www.dask.org/), and NeMo Curator to build an Indic language translation pipeline. After creating the pipeline, we demonstrate how to use the model to translate English text to Hindi text.

## Environment Setup

In [1]:
import importlib.util
import os
import sys
import re
from dataclasses import dataclass

In [2]:
# Install NLTK if not already installed
!pip install nltk



In [3]:
import cudf
import dask_cudf
import nltk
import numpy as np
import torch
import torch.nn as nn
from crossfit import op
from crossfit.backend.torch.hf.model import HFModel
from dask.distributed import get_worker
from nltk.tokenize import sent_tokenize
from transformers import AutoConfig, AutoModelForSeq2SeqLM, AutoTokenizer

nltk.download("punkt_tab")

  from .autonotebook import tqdm as notebook_tqdm
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /home/nfs/syurick/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [4]:
from nemo_curator.classifiers.base import DistributedDataClassifier
from nemo_curator.datasets import DocumentDataset
from nemo_curator.utils.distributed_utils import get_client, load_object_on_worker

The [IndicTransToolkit](https://github.com/VarunGumma/IndicTransToolkit) provides a simple, modular, and extendable toolkit for [IndicTrans2](https://github.com/AI4Bharat/IndicTrans2), an open-source transformer-based multilingual NMT model that supports high-quality translations across all the 22 scheduled Indic languages.

In [5]:
try:
    from IndicTransToolkit import IndicProcessor
except ModuleNotFoundError:
    raise ImportError(
        "IndicTransToolkit not found. Please install it using the following command: \n"
        + "pip install git+https://github.com/VarunGumma/IndicTransToolkit.git"
    )

Finally, we need to add `transformers_modules` from the Hugging Face cache to the Python path. First, we can download and cache the model with:

In [6]:
AutoModelForSeq2SeqLM.from_pretrained("ai4bharat/indictrans2-en-indic-1B", trust_remote_code=True)

IndicTransForConditionalGeneration(
  (model): IndicTransModel(
    (encoder): IndicTransEncoder(
      (embed_tokens): Embedding(32322, 1024, padding_idx=1)
      (embed_positions): IndicTransSinusoidalPositionalEmbedding()
      (layers): ModuleList(
        (0-17): 18 x IndicTransEncoderLayer(
          (self_attn): IndicTransAttention(
            (k_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (v_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (q_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (activation_fn): GELUActivation()
          (fc1): Linear(in_features=1024, out_features=8192, bias=True)
          (fc2): Linear(in_features=8192, out_features=1024, bias=True)
          (final_layer_norm): LayerNorm((1024,), eps=1e-05,

Assuming that the Hugging Face cache is located in the user's home directory, we can then add the `transformers_modules` directory to the Python path with:

In [7]:
hf_modules_path = os.path.expanduser("~/.cache/huggingface/modules/transformers_modules")
sys.path.append(hf_modules_path)

if "transformers_modules" not in sys.modules:
    spec = importlib.util.spec_from_file_location("transformers_modules", os.path.join(hf_modules_path, "__init__.py"))
    if spec and spec.loader:
        module = importlib.util.module_from_spec(spec)
        spec.loader.exec_module(module)
        sys.modules["transformers_modules"] = module

module_name = "transformers_modules.ai4bharat"
module_path = os.path.join(hf_modules_path, "ai4bharat")

spec = importlib.util.spec_from_file_location(module_name, os.path.join(module_path, "__init__.py"))
if spec and spec.loader:
    ai4bharat_module = importlib.util.module_from_spec(spec)
    spec.loader.exec_module(ai4bharat_module)
    sys.modules[module_name] = ai4bharat_module

## Helper Classes and Functions for the `IndicTranslation` Class

To create our Indic translation classifier, we create an `IndicTranslation` class, which will be extended from NeMo Curator's `DistributedDataClassifier` class.

The goal of the base `DistributedDataClassifier` class is to enable multi-node multi-GPU data classification of your data. NeMo Curator provides several subclasses that focus on various tasks, such as domain and quality classification. However, the `DistributedDataClassifier` can be extended to fit *any* model; the only requirement is that the model can fit on a single GPU. See NeMo Curator's [Distributed Data Classification](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/distributeddataclassification.html) documentation for more information.

First, let's create a `TranslationConfig` class. Its purpose is to store some of the attributes that will be used by our model, including the model card of the [IndicTrans2 En-Indic 1.1B variant](https://huggingface.co/ai4bharat/indictrans2-en-indic-1B) on Hugging Face.

In [8]:
@dataclass
class TranslationConfig:
    pretrained_model_name_or_path: str = "ai4bharat/indictrans2-en-indic-1B"
    max_length: int = 50
    num_beams: int = 5
    autocast: bool = False
    max_words_per_sen: int = 200

Next, we create a `CustomModel` class for sequence-to-sequence language modeling. It inherits from `nn.Module`, the base class for all neural network modules in PyTorch.

Inside `__init__`, the model loads a pre-trained sequence-to-sequence model (`AutoModelForSeq2SeqLM`) from Hugging Face, using the model name provided. The `autocast` boolean determines whether mixed precision (`torch.autocast`) is used during inference to speed up computations on CUDA devices; we set it to False above.

The `_forward` method performs text generation on the input batch without tracking gradients (`@torch.no_grad()`), which is efficient for inference. `self.model.generate()` is called with the batch inputs and several generation parameters to control the decoding behavior. The `forward` method is required by `nn.Module` and runs the model's forward pass (the computation performed at every call).

In [9]:
class CustomModel(nn.Module):
    def __init__(self, config: TranslationConfig):
        super().__init__()
        self.config = config
        self.model = AutoModelForSeq2SeqLM.from_pretrained(
            pretrained_model_name_or_path=config.pretrained_model_name_or_path,
            trust_remote_code=True,
        )
        self.autocast = config.autocast

    @torch.no_grad()
    def _forward(self, batch: dict) -> torch.Tensor:
        return self.model.generate(
            **batch,
            use_cache=True,
            min_length=0,
            max_length=self.config.max_length,
            num_beams=self.config.num_beams,
            num_return_sequences=1,
            repetition_penalty=1.2,
        )

    def forward(self, batch: dict) -> torch.Tensor:
        if self.autocast:
            with torch.autocast(device_type="cuda"):
                outputs = self._forward(batch)
        else:
            outputs = self._forward(batch)
        return outputs

Now, let's create the `ModelForSeq2SeqModel` class, a model management class that handles loading configurations, the `CustomModel`, and tokenizers for sequence-to-sequence translation. It inherits from `HFModel`, a class created by NVIDIA's [CrossFit](https://github.com/rapidsai/crossfit) library, which enables multi-node and multi-GPU offline inference.

In it, we create several methods which define how to load our model, its configuration, and its tokenizer.

In [10]:
class ModelForSeq2SeqModel(HFModel):
    def __init__(self, config: TranslationConfig):
        self.trans_config = config
        self.config = self.load_config()
        super().__init__(self.trans_config.pretrained_model_name_or_path)

    def load_model(self, device: str = "cuda") -> CustomModel:
        model = CustomModel(self.trans_config)
        model = model.to(device)
        model.eval()
        return model

    def load_config(self) -> AutoConfig:
        return AutoConfig.from_pretrained(
            pretrained_model_name_or_path=self.trans_config.pretrained_model_name_or_path,
            trust_remote_code=True,
        )

    def load_tokenizer(self) -> AutoTokenizer:
        return AutoTokenizer.from_pretrained(
            pretrained_model_name_or_path=self.trans_config.pretrained_model_name_or_path,
            trust_remote_code=True,
        )

    def max_seq_length(self) -> int:
        return self.config.max_source_positions

    def load_cfg(self):
        return self.load_config()

Finally, let's define some helper functions which will be used by our `IndicTranslation` class.

The `preprocess_df` function is used to load and run the `IndicProcessor` to preprocess our English sentences before tokenization. Note our use of the `load_object_on_worker` function, which loads and stores the `IndicProcessor` on each Dask worker.

In [11]:
def preprocess_df(df: cudf.DataFrame, text_field: str = "text") -> cudf.DataFrame:
    ip = load_object_on_worker(
        "IndicProcessor", IndicProcessor, {"inference": True}
    )

    indices = df[text_field].index.to_arrow().to_pylist()
    sentences = df[text_field].to_arrow().to_pylist()
    sentences = ip.preprocess_batch(
        sentences, src_lang="eng_Latn", tgt_lang="hin_Deva"
    )

    df["indic_proc_text"] = cudf.Series(sentences, index=indices)
    return df

The `has_alphabet_characters` function checks if there is at least one alphabetic character in a given string; the `atleast_letter` function applies it to a DataFrame column to produce another column of booleans.

In [12]:
def has_alphabet_characters(text: str) -> bool:
    return any(c.isalpha() for c in text)


def atleast_letter(df: cudf.DataFrame, text_field: str) -> cudf.DataFrame:
    df = df.to_pandas()
    df["isalpha"] = df[text_field].apply(has_alphabet_characters)
    df = cudf.DataFrame(df)
    return df

After translating our text, the `combine_text` function modifies the translated column by removing the vertical bar `|` (which is used as a stop marker in our translations) at the end of the text where: (1) the text does not end with a period and (2) the translation ends with a vertical bar. Thus, we keep translations ending with a vertical bar only when the English text ends with a period.

In [13]:
def combine_text(df: cudf.DataFrame, text_field: str = "text") -> cudf.DataFrame:
    english_stop_flag = df[text_field].str.endswith(".")
    hindi_stop_flag = df["translation"].str.endswith("|")

    df["translation"][~english_stop_flag & hindi_stop_flag] = df[
        "translation"
    ].str.rstrip("|")

    df["translation"] = df["translation"].str.strip()
    return df

The `grouping` function groups rows by `doc_id`, concatenates text-based columns, and retains the first value of other columns within each group. This is useful because our texts will be spread across several rows, but marked with the same `doc_id`. Thus, we use this function to combine those rows.

In [14]:
def grouping(df: cudf.DataFrame, text_field: str = "text") -> cudf.DataFrame:
    df = df.to_pandas()

    agg_funcs = {
        "translation": lambda s: "".join(s),
        text_field: lambda s: "".join(s),
    }

    other_columns = {
        col: "first"
        for col in df.columns
        if col not in agg_funcs and col != "doc_id"
    }

    agg_funcs.update(other_columns)
    df = df.groupby("doc_id").agg(agg_funcs).reset_index()
    df = cudf.DataFrame.from_pandas(df)
    return df

## Building the `IndicTranslation` Class

Our `IndicTranslation` class is a bit of a monster, containing many methods within it. For this tutorial, we aim to make it as digestible as possible by stepping through each method, one by one.

While this first method may look intimidating, its goal is very simple: create a list of sentences from a given string. It does this by using NLTK tokenization to break the text into sentences. We also remove sentences that are too long.

In [15]:
def custom_tokenize(self, text: str):
    split_text = re.split(
        r"(\#{2,}|\_{2,}|\…{2,}|\+{2,}|\.{2,}|\-{3,}|\*{2,}|\~{2,}|\={2,}|\!{2,}|\n|\t|\‣|\⁃|\⁌|\⁍|\●|\○|\•|\·|\◘|\◦|\⦾|\⦿|\|)",
        text,
    )

    split_text = [s for s in split_text if len(s) > 0]
    tokenized_sentences = []
    len_flag = False

    for line in split_text:
        # Tokenize sentences using NLTK's sent_tokenize function
        if has_alphabet_characters(line) == True:
            sentences = sent_tokenize(line)
            i = 0
            j = 0
            curr_tokenized_snt = []
            non_translation_str = ""

            # Comparing the list of tokenized sentences (using NLTK) and actual the sentence,
            # preserving the spaces, newline and other special characters
            while i < len(line):
                if j < len(sentences):
                    stripped_sent = sentences[j].strip()

                    if len(stripped_sent) == 0:
                        j += 1
                        continue

                    # If tokenized sentence matches, then moving to next sentence
                    if line[i] == stripped_sent[0]:
                        if non_translation_str != "":
                            curr_tokenized_snt.append(non_translation_str)

                        curr_tokenized_snt.append(stripped_sent)
                        i += len(stripped_sent)
                        j += 1
                        non_translation_str = ""

                    else:
                        non_translation_str += line[i]
                        i += 1

                else:
                    non_translation_str += line[i]
                    i += 1

            if non_translation_str != "":
                curr_tokenized_snt.append(non_translation_str)

            # Add the tokenized sentences to the list
            tokenized_sentences.extend(curr_tokenized_snt)

        else:
            tokenized_sentences.append(line)

    tokenized_sentence_len = []
    for sentence in tokenized_sentences:
        sent = sentence.split()
        # Removing the sentences with word length greater than threshold
        # Since the model may not be able translate it due to constraint on output token size
        if len(sent) <= self.translation_config.max_words_per_sen:
            tokenized_sentence_len.append(sentence)

    return tokenized_sentence_len

This method uses the `custom_tokenize` method above to create a DataFrame where each sentence has its own row, preserving the `doc_id` for context.

For example, if we have the DataFrame:

| text                                                   |
|--------------------------------------------------------|
| "This is a first sentence. This is a second sentence." |
| "This is a third sentence. This is a fourth sentence." |

Then the resulting DataFrame will be:

| text                         | doc_id |
|------------------------------|--------|
| "This is a first sentence."  | 1      |
| "This is a second sentence." | 1      |
| "This is a third sentence."  | 2      |
| "This is a fourth sentence." | 2      |

In [16]:
def process_input_text(self, df: cudf.DataFrame, text_field: str = "text") -> cudf.DataFrame:
    df = df.to_pandas()
    df[text_field] = df[text_field].apply(self.custom_tokenize)
    df["doc_id"] = np.arange(1, len(df) + 1)
    df = df.explode(text_field, ignore_index=True)
    df = df.reset_index(drop=False)
    df = cudf.DataFrame.from_pandas(df)
    return df

After our translations are generated, the `translate_tokens` method further processes the translations by decoding the tokens back to human-readable text and applying postprocessing with the `IndicProcessor`.

In [17]:
def translate_tokens(self, df: cudf.DataFrame) -> cudf.DataFrame:
    worker = get_worker()

    if hasattr(worker, "IndicProcessor"):
        ip = getattr(worker, "IndicProcessor")
    else:
        ip = load_object_on_worker(
            "IndicProcessor", IndicProcessor, {"inference": True}
        )

    tokenizer = self.model.load_tokenizer()
    indices = df["translation"].index.to_arrow().to_pylist()
    generated_tokens = df["translation"].to_arrow().to_pylist()

    with tokenizer.as_target_tokenizer():
        generated_tokens = tokenizer.batch_decode(
            generated_tokens,
            skip_special_tokens=True,
        )

    generated_tokens = ip.postprocess_batch(generated_tokens, lang="hin_Deva")
    df["translation"] = cudf.Series(data=generated_tokens, index=indices)
    return df

Finally, we create the `IndicTranslation` class by defining the `__init__` and `_run_classifier` methods. We start with the `__init__` method, which uses the `DistributedDataClassifier`, `TranslationConfig`, and `ModelForSeq2SeqModel` classes described above.

We then combine all of the helper functions and class methods into the `_run_classifier` method. This is the method that is called by `DistributedDataClassifier`'s `__call__` method; it is required for all classes that inherit the `DistributedDataClassifier` class.

In [18]:
class IndicTranslation(DistributedDataClassifier):
    def __init__(
        self,
        pretrained_model_name_or_path: str = "ai4bharat/indictrans2-en-indic-1B",
        text_field: str = "text",
        batch_size: int = 128,
        autocast: bool = False,
    ):
        self.pretrained_model_name_or_path = pretrained_model_name_or_path
        self.text_field = text_field
        self.batch_size = batch_size
        self.autocast = autocast

        self.translation_config = TranslationConfig(
            pretrained_model_name_or_path=self.pretrained_model_name_or_path,
            max_length=256,
            num_beams=5,
            autocast=self.autocast,
        )

        self.model = ModelForSeq2SeqModel(self.translation_config)

        super().__init__(
            model=self.model,
            batch_size=self.batch_size,
            device_type="cuda",
            autocast=self.autocast,
            labels=None,
            filter_by=None,
            out_dim=None,
            pred_column=None,
            max_chars=None,
        )

    def _run_classifier(self, dataset: DocumentDataset) -> DocumentDataset:
        ddf = dataset.df
        # See process_input_text helper function defined above
        ddf = ddf.map_partitions(self.process_input_text, text_field=self.text_field, enforce_metadata=False)
        ddf[self.text_field] = ddf[self.text_field].astype("str")

        ddf["word_count"] = ddf[self.text_field].str.split().list.len()
        ddf["word_count"] = ddf["word_count"].astype("int64")
        ddf_true = ddf[(ddf["word_count"] <= self.translation_config.max_words_per_sen)]

        # Filter for at least one unicode letter in text
        # See atleast_letter helper function defined above
        has_letter = ddf_true.map_partitions(atleast_letter, text_field=self.text_field)
        ddf_trans = ddf_true[has_letter["isalpha"]]
        ddf = ddf_trans.drop(columns="word_count")

        ## ddf_false operations
        ddf_false = ddf_true[~has_letter["isalpha"]]
        ddf_false = ddf_false.drop(columns="word_count")
        ddf_false["translation"] = ddf_false[self.text_field]

        # Applying preprocess_df helper function for Indic preprocessing
        ddf[self.text_field] = ddf[self.text_field].astype("str")
        ddf_meta = ddf._meta.copy()
        ddf_meta["indic_proc_text"] = ""
        ddf = ddf.map_partitions(preprocess_df, text_field=self.text_field, meta=ddf_meta)

        columns = ddf.columns.tolist()
        pipe = op.Sequential(
            # This step tokenizes the input text found in the specified text_field
            op.Tokenizer(
                self.model, cols=[self.text_field], tokenizer_type="default"
            ),
            # The Predictor takes the tokenized input and passes it through the model to generate translations
            op.Predictor(
                self.model,
                sorted_data_loader=True,
                batch_size=self.batch_size,
                pred_output_col="translation",
            ),
            keep_cols=columns,
        )
        ddf = pipe(ddf)
        translated_meta = ddf._meta.copy()
        translated_meta["translation"] = "DUMMY_STRING"
        ddf = ddf.map_partitions(self.translate_tokens, meta=translated_meta)
        ddf = ddf.map_partitions(combine_text, text_field=self.text_field, meta=translated_meta)

        # Merging translated and non-translated samples
        ddf_true["false_translation"] = ddf_false["translation"]
        ddf_true["false_translation"] = ddf_true["false_translation"].fillna("")
        ddf_true["translation"] = ddf["translation"]
        ddf_true["translation"] = ddf_true["translation"].fillna("")
        ddf_true["translation"] = (
            ddf_true["translation"] + ddf_true["false_translation"]
        )

        # See grouping helper function defined above
        ddf = ddf_true.map_partitions(grouping, text_field=self.text_field)

        ddf = ddf.drop(columns=["index", "word_count", "false_translation"])
        return DocumentDataset(ddf)

In [19]:
# Add the functions defined above to the IndicTranslation class
IndicTranslation.custom_tokenize = custom_tokenize
IndicTranslation.process_input_text = process_input_text
IndicTranslation.translate_tokens = translate_tokens

## Run Indic Translation 

We have successfully built our Indic translation pipeline! Now, let's demonstrate how to use it with a simple example.

First, let's create a Dask client.

In [20]:
device = "gpu"
client = get_client(cluster_type=device)

cuDF Spilling is enabled


Next, let's create a `DocumentDataset` with some English sentences to translate.

In [21]:
text = [
    "Quantum computing is set to revolutionize the field of cryptography.",
    "Investing in index funds is a popular strategy for long-term financial growth.",
    "Recent advancements in gene therapy offer new hope for treating genetic disorders.",
    "Online learning platforms have transformed the way students access educational resources.",
    "Traveling to Europe during the off-season can be a more budget-friendly option.",
    "Training regimens for athletes have become more sophisticated with the use of data analytics.",
    "Streaming services are changing the way people consume television and film content.",
    "Vegan recipes have gained popularity as more people adopt plant-based diets.",
    "Climate change research is critical for developing sustainable environmental policies.",
    "Telemedicine has become increasingly popular due to its convenience and accessibility.",
]
df = cudf.DataFrame({"text": text})
input_dataset = DocumentDataset(dask_cudf.from_cudf(df, npartitions=1))

Then, we can initialize our `IndicTranslation` model.

In [22]:
text_field = "text"
batch_size = 128
autocast = True

translator_model = IndicTranslation(
    pretrained_model_name_or_path="ai4bharat/indictrans2-en-indic-1B",
    text_field=text_field,
    batch_size=batch_size,
    autocast=autocast,
)

https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations


Now let's translate our text!

In [23]:
result_dataset = translator_model(dataset=input_dataset)

In [24]:
result_dataset.df.compute()

GPU: tcp://127.0.0.1:34409, Part: 0: 100%|██████████| 10/10 [00:13<00:00,  1.34s/it]


Unnamed: 0,doc_id,translation,text
0,1,क्वाण्टम कम्प्यूटिंग क्रिप्टोग्राफी के क्षेत्र...,Quantum computing is set to revolutionize the ...
1,2,इन्डेक्स फंड्स (अनुक्रमणिका फंड्स) में निवेश द...,Investing in index funds is a popular strategy...
2,3,जीन चिकित्सा में हालिया प्रगति आनुवंशिक विकारो...,Recent advancements in gene therapy offer new ...
3,4,ऑनलाइन लर्निंग (ऑनलाइन शिक्षण) प्लेटफॉर्म ने छ...,Online learning platforms have transformed the...
4,5,""""" """" """" """" """" """" """" """" """" """" """" """" """" """" """" ""...",Traveling to Europe during the off-season can ...
5,6,डेटा एनालिटिक्स के उपयोग के साथ एथलीटों के लिए...,Training regimens for athletes have become mor...
6,7,स्ट्रीमिंग स्ट्रीमिंग सेवाएँ टेलीविजन और फिल्म...,Streaming services are changing the way people...
7,8,वेगन वेज के व्यंजनों ने लोकप्रियता हासिल कर ली...,Vegan recipes have gained popularity as more p...
8,9,"जलवायु परिवर्तन संशोधन-परिवर्तन अनुसंधान, टिका...",Climate change research is critical for develo...
9,10,"टेलिमेडिसिन टेलिमेडिसिन (TELMED), सुविधा और सु...",Telemedicine has become increasingly popular d...


Finally, we close our Dask client.

In [25]:
client.close()

Thank you for following this tutorial! We have demonstrated how to create and run an Indic translation pipeline in NeMo Curator.