# Indic Translation
This notebook demonstrate an example use of nemo-curator for Indic language generation via translation from English language which can be scaleup to use multiple node multiple gpus. This workflow is accelarated by CrossFit, a library that leverages intellegent batching and RAPIDS to accelerate the offline inference on large datasets. 
This example uses ctransalte2 model from [here](https://indictrans2-public.objectstore.e2enetworks.net/it2_preprint_ckpts/en-indic-preprint.zip), taken from IndicTrans2 github repo, [here](https://github.com/AI4Bharat/IndicTrans2/tree/main?tab=readme-ov-file#multilingual-translation-models)

## Imports section

In [1]:
import os

os.environ["DASK_DATAFRAME__QUERY_PLANNING"] = "False"
import argparse
import re
import time
from dataclasses import dataclass

import cudf
import ctranslate2
import numpy as np
import torch
from dask.distributed import get_worker
import dask_cudf
from nltk.tokenize import sent_tokenize
from transformers import AutoConfig, AutoModelForSeq2SeqLM, AutoTokenizer

### crossfit and nemo_curator imports

In [2]:
from crossfit import op
from crossfit.backend.torch.hf.model import HFModel
from nemo_curator.classifiers.base import DistributedDataClassifier
from nemo_curator.datasets import DocumentDataset
from nemo_curator.utils.distributed_utils import get_client, load_object_on_worker

### IndicTransToolkit import
For pre and post processing, we are using IndicTransToolkit, required for translation using IndcTrans2 models. It is simple, modoular library for preprocessing, normalizations, postprocessing stuff.

In [3]:
try:
    from IndicTransToolkit import IndicProcessor
except ImportError:
    raise ImportError(
        "IndicTransToolkit not found. Please install it using the following command: \n"
        + "pip install git+https://github.com/VarunGumma/IndicTransToolkit.git"
    )


## Table of content
1. [Ctranslate2 model integration](#ctranslate2-model-integration).
2. [Define IndicTranslation class](#define-indictranslation-class).
3. [Start the dask cluster](#start-the-dask-cluster).
4. [Define input](#define-input).
5. [Define output directory](#define-output-directory).
6. [Start the trasnaltion](#start-the-translation).

### CTranslate2 Model Integration

We'll now create a custom CTranslate2 model class, which is essential for performing inference on the CT2 converted model. This example uses ctransalte2 model from [here](https://indictrans2-public.objectstore.e2enetworks.net/it2_preprint_ckpts/en-indic-preprint.zip), taken from IndicTrans2 github repo, [here](https://github.com/AI4Bharat/IndicTrans2/tree/main?tab=readme-ov-file#multilingual-translation-models). CTranslate2 is a C++ and Python library for efficient inference with Transformer models. You can read more about it [here](https://github.com/OpenNMT/CTranslate2). One of the features of it is, it enables fast and efficient execution on both CPU and GPU.

This class will be integrated with CrossFit's `HFModel` class to leverage CrossFit's efficient batching and processing capabilities.

#### Key Components:

1. **CT2CustomModel**: A custom class for CTranslate2 model inference.
2. **ModelForSeq2SeqModel**: An extension of CrossFit's `HFModel` class, tailored for our translation task.
3. **TranslationConfig**: A dataclass for managing translation-specific configuration parameters.

These model definitions are inspired by examples from the CrossFit project. For reference, you can find similar implementations in the [CrossFit GitHub repository](https://github.com/rapidsai/crossfit/pull/83/files#diff-d3c29a7456aac8be2bb3d53ba3d983e36631ea8dd36c4e52d9f3217183d4568f).


In [4]:
@dataclass
class TranslationConfig:
    pretrained_model_name_or_path: str
    ct2_model_path: str
    max_words_per_sen: int = 200
    target_lang_code: str = "hin_Deva"


class CT2CustomModel:
    def __init__(self, config: TranslationConfig, device="cuda"):
        self.config = config
        self.tokenizer = AutoTokenizer.from_pretrained(
            pretrained_model_name_or_path=config.pretrained_model_name_or_path,
            trust_remote_code=True,
        )
        self.model = ctranslate2.Translator(
            model_path=config.ct2_model_path, device=device
        )

    def clean_extra_tokens(self, token_2d):
        results = []
        for token_1d in token_2d:
            result = []
            for t in token_1d:
                if (
                    t == self.tokenizer.pad_token
                    or t == self.tokenizer.bos_token
                    or t == self.tokenizer.eos_token
                    or t == self.tokenizer.unk_token
                ):
                    pass
                else:
                    result.append(t)
            results.append(result)
        return results

    def __call__(self, batch):
        token_ids_2d = batch["input_ids"]
        token_ids_1d = token_ids_2d.view(-1).tolist()
        tokens_1d = self.tokenizer.convert_ids_to_tokens(token_ids_1d)
        tokens_2d = [
            tokens_1d[i : i + token_ids_2d.size(1)]
            for i in range(0, len(tokens_1d), token_ids_2d.size(1))
        ]
        tokens = self.clean_extra_tokens(tokens_2d)

        tr_res = self.model.translate_batch(
            tokens,
            min_decoding_length=0,
            max_decoding_length=256,
            beam_size=5,
            num_hypotheses=1,
        )
        translations = ["".join(x.hypotheses[0]) for x in tr_res]
        return translations


class ModelForSeq2SeqModel(HFModel):
    def __init__(self, config):
        self.trans_config = config
        self.config = self.load_cfg()
        super().__init__(
            self.trans_config.pretrained_model_name_or_path, model_output_type="string"
        )

    def load_model(self, device="cuda"):
        model = CT2CustomModel(self.trans_config)
        return model

    def load_config(self):
        return self.load_cfg()

    def load_tokenizer(self):
        return AutoTokenizer.from_pretrained(
            pretrained_model_name_or_path=self.trans_config.pretrained_model_name_or_path,
            trust_remote_code=True,
        )

    def max_seq_length(self) -> int:
        return self.config.max_source_positions

    def load_cfg(self):
        config = AutoConfig.from_pretrained(
            pretrained_model_name_or_path=self.trans_config.pretrained_model_name_or_path,
            trust_remote_code=True,
        )
        return config

### Define IndicTranslation class

Now that we have created relevent model classes from crossfit side, for running the pipeline we need to inherit from __DistributedDataClassifier__ of nemo-curator, and implement its **_run_classifier** method inside __IndicTranslation__

_run_classifier method is responsible for running the inference. For our translation use case we need to have preprocessing, filtering before call for inference and postprocessing after the inference.

In this example we have added pre and postprocessing from _run_classifier method. Overall major steps will be as follows : 

1.  Run process_input_text method which will be responsible for breaking english sentences via nltk's sentence tokenizer into sentences of specified length(default = 200 words).
2.  Filter data where sentence should at least have 1 alphabet in it.
3.  Left over data from step 2 won't go for translation but will be added for final data with translation as same as input text.
4.  Data which passed from step 2 will go fro indic preprocessing from IndicTransToolkit.
5.  CrossFit's Toeknizer and Predictor will run on data.
6.  Output from step 5 will go for detokenization and indic postprocessing.
7.  Combining the results from step 6 and step 3 and reutrn the data.


In [12]:
class IndicTranslation(DistributedDataClassifier):
    def __init__(
        self,
        ct2_model_path: str,
        pretrained_model_name_or_path: str = "ai4bharat/indictrans2-en-indic-1B",
        input_column: str = "indic_proc_text",
        batch_size: int = 128,
        autocast: bool = False,
        target_lang_code: str = "hin_Deva",
    ):
        self.input_column = input_column
        self.batch_size = batch_size
        self.autocast = autocast

        self.translation_config = TranslationConfig(
            pretrained_model_name_or_path=pretrained_model_name_or_path,
            ct2_model_path=ct2_model_path,
            target_lang_code=target_lang_code,
        )
        self.model = ModelForSeq2SeqModel(self.translation_config)
        super().__init__(
            model=self.model,
            batch_size=self.batch_size,
            device_type="cuda",
            autocast=self.autocast,
            labels=None,
            filter_by=None,
            out_dim=None,
            pred_column=None,
            max_chars=None,
        )

    def preprocess_df(self, df: cudf.DataFrame) -> cudf.DataFrame:
        ip = load_object_on_worker(
            "IndicProcessor", IndicProcessor, {"inference": True}
        )
        indices = df["text"].index.to_arrow().to_pylist()
        sentences = df["text"].to_arrow().to_pylist()
        sentences = ip.preprocess_batch(
            sentences,
            src_lang="eng_Latn",
            tgt_lang=self.translation_config.target_lang_code,  # "hin_Deva"
        )
        df["indic_proc_text"] = cudf.Series(sentences, index=indices)
        return df

    def translate_tokens(self, df: cudf.DataFrame) -> cudf.DataFrame:
        worker = get_worker()
        if hasattr(worker, "IndicProcessor"):
            ip = getattr(worker, "IndicProcessor")
        else:
            ip = load_object_on_worker(
                "IndicProcessor", IndicProcessor, {"inference": True}
            )
        tokenizer = self.model.load_tokenizer()
        indices = df["translation"].index.to_arrow().to_pylist()
        generated_tokens = df["translation"].to_arrow().to_pylist()
        converted_tokens = []
        for g in generated_tokens:
            converted_tokens.append(tokenizer.convert_tokens_to_string(g))
        converted_tokens = ip.postprocess_batch(
            converted_tokens, lang=self.translation_config.target_lang_code
        )
        print(f"Translated samples :\n{converted_tokens}")
        df["translation"] = cudf.Series(data=converted_tokens, index=indices)
        return df

    def has_alphabet_characters(self, text: str) -> bool:
        return any(c.isalpha() for c in text)

    def custom_tokenize(self, text: str):
        split_text = re.split(
            r"(\#{2,}|\_{2,}|\…{2,}|\+{2,}|\.{2,}|\-{3,}|\*{2,}|\~{2,}|\={2,}|\!{2,}|\n|\t|\‣|\⁃|\⁌|\⁍|\●|\○|\•|\·|\◘|\◦|\⦾|\⦿|\|)",
            text,
        )
        split_text = [s for s in split_text if len(s) > 0]
        tokenized_sentences = []
        len_flag = False
        for line in split_text:
            # Tokenize sentences using NLTK's sent_tokenize function
            if self.has_alphabet_characters(line) == True:
                sentences = sent_tokenize(line)
                i = 0
                j = 0
                curr_tokenized_snt = []
                non_translation_str = ""
                # Comparing the list of tokenized sentences (using NLTK) and actual sentence and preserving the spaces,
                # newline and other special characters
                while i < len(line):
                    if j < len(sentences):
                        stripped_sent = sentences[j].strip()
                        if len(stripped_sent) == 0:
                            j += 1
                            continue
                        # If tokenized sentence matches then moving to next sentence
                        if line[i] == stripped_sent[0]:
                            if non_translation_str != "":
                                curr_tokenized_snt.append(non_translation_str)
                            curr_tokenized_snt.append(stripped_sent)
                            i += len(stripped_sent)
                            j += 1
                            non_translation_str = ""
                        else:
                            non_translation_str += line[i]
                            i += 1
                    else:
                        non_translation_str += line[i]
                        i += 1
                if non_translation_str != "":
                    curr_tokenized_snt.append(non_translation_str)
                # Add the tokenized sentences to the list
                tokenized_sentences.extend(curr_tokenized_snt)
            else:
                tokenized_sentences.append(line)

        tokenized_sentence_len = []
        for sentence in tokenized_sentences:
            sent = sentence.split()
            # removing the sentences with word length greater than threshold as the model may not be able translate it due to constraint on output token size
            if len(sent) <= self.translation_config.max_words_per_sen:
                tokenized_sentence_len.append(sentence)

        return tokenized_sentence_len

    def process_input_text(self, df: cudf.DataFrame) -> cudf.DataFrame:
        df = df.to_pandas()
        df["text"] = df["text"].apply(self.custom_tokenize)
        df["doc_id"] = np.arange(1, len(df) + 1)
        df = df.explode("text", ignore_index=True)
        df = df.reset_index(drop=False)
        df = cudf.DataFrame.from_pandas(df)
        return df

    def remove_false_fullstop(self, df: cudf.DataFrame) -> cudf.DataFrame:
        engligh_stop_flag = df["text"].str.endswith(".")
        hindi_stop_flag = df["translation"].str.endswith("|")
        df["translation"][~engligh_stop_flag & hindi_stop_flag] = df[
            "translation"
        ].str.rstrip("|")
        df["translation"] = df["translation"].str.strip()
        return df

    def grouping(self, df: cudf.DataFrame) -> cudf.DataFrame:
        df = df.to_pandas()
        agg_funcs = {
            "translation": lambda s: "".join(s),
            "text": lambda s: "".join(s),
        }
        other_columns = {
            col: "first"
            for col in df.columns
            if col not in agg_funcs and col != "doc_id"
        }

        agg_funcs.update(other_columns)
        df = df.groupby("doc_id").agg(agg_funcs).reset_index()
        df = cudf.DataFrame.from_pandas(df)
        return df

    def atleast_letter(self, df: cudf.DataFrame, column_name: str) -> cudf.DataFrame:
        df = df.to_pandas()
        df["isalpha"] = df[column_name].apply(self.has_alphabet_characters)
        df = cudf.DataFrame(df)
        return df

    def _run_classifier(self, dataset: DocumentDataset) -> DocumentDataset:
        ddf = dataset.df
        # Applying process_input_text for following :
        # 1. nltk tokenization to break doc into sentences
        # 2. craeting a row w.r.t each sentence.
        # 3. Process sentences strip symbols from start and end
        ddf_true = ddf.map_partitions(self.process_input_text, enforce_metadata=False)
        ddf_true["text"] = ddf_true["text"].astype("str")

        # To filter for atleast one unicode letter in text
        has_letter = ddf_true.map_partitions(self.atleast_letter, column_name="text")
        ddf = ddf_true[has_letter["isalpha"]]
        ## ddf false operations
        ddf_false = ddf_true[~has_letter["isalpha"]]
        ddf_false["translation"] = ddf_false["text"]
        # Applying preprocess_df for Indic preprocessing
        ddf["text"] = ddf["text"].astype("str")
        ddf_meta = ddf._meta.copy()
        ddf_meta["indic_proc_text"] = ""
        ddf = ddf.map_partitions(self.preprocess_df, meta=ddf_meta)

        columns = ddf.columns.tolist()
        pipe = op.Sequential(
            op.Tokenizer(
                self.model,
                cols=[self.input_column],
                tokenizer_type="default",
                max_length=255,
            ),
            op.Predictor(
                self.model,
                sorted_data_loader=True,
                batch_size=self.batch_size,
                pred_output_col="translation",
            ),
            keep_cols=columns,
        )
        ddf = pipe(ddf)
        translated_meta = ddf._meta.copy()
        translated_meta["translation"] = "DUMMY_STRING"
        ddf = ddf.map_partitions(self.translate_tokens, meta=translated_meta)
        ddf = ddf.map_partitions(self.remove_false_fullstop, meta=translated_meta)

        # Merging translated and non-translated samples
        ddf_true["false_translation"] = ddf_false["translation"]
        ddf_true["false_translation"] = ddf_true["false_translation"].fillna("")
        ddf_true["translation"] = ddf["translation"]
        ddf_true["translation"] = ddf_true["translation"].fillna("")
        ddf_true["translation"] = (
            ddf_true["translation"] + ddf_true["false_translation"]
        )
        ddf_true.drop(columns=['false_translation'])
        ddf = ddf_true.map_partitions(self.grouping)
        return DocumentDataset(ddf)

### Start the Dask Cluster
NeMo Curator runs on Dask and Dask-cuDF to distribute computation. You can read more about it [in the documentation](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/cpuvsgpu.html). All of the image curation modules are GPU-based, so we need to start a GPU-based local Dask cluster before we can use them.

In [6]:
client = get_client(cluster_type="gpu")

### Define input

In [7]:
text = [
    "Quantum computing is set to revolutionize the field of cryptography.",
    "Investing in index funds is a popular strategy for long-term financial growth.",
    "Recent advancements in gene therapy offer new hope for treating genetic disorders.",
    "Online learning platforms have transformed the way students access educational resources.",
    "Traveling to Europe during the off-season can be a more budget-friendly option.",
    "Training regimens for athletes have become more sophisticated with the use of data analytics.",
    "Streaming services are changing the way people consume television and film content.",
    "Vegan recipes have gained popularity as more people adopt plant-based diets.",
    "Climate change research is critical for developing sustainable environmental policies.",
    "Telemedicine has become increasingly popular due to its convenience and accessibility.",
]
df = cudf.DataFrame({"text": text})
input_dataset = DocumentDataset(dask_cudf.from_cudf(df, npartitions=1))
write_to_filename = False

### Define output directory


In [8]:
output_data_dir = "out_data"

### Start the translation

IndicTranslation will need ct2_model_path, the model path of ctranslate2 converted model(which is downloaded from [here](https://indictrans2-public.objectstore.e2enetworks.net/it2_preprint_ckpts/en-indic-preprint.zip)).

In [13]:
ct2_model_path = "/tmp/en-indic-preprint/ct2_fp16_model"
translator_model = IndicTranslation(
    ct2_model_path=ct2_model_path,
)
result_dataset = translator_model(dataset=input_dataset)
result_dataset.to_json(
    output_file_dir=output_data_dir, write_to_filename=write_to_filename
)

https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations


Translated samples :
['  क्वांटम कंप्यूटिंग क्रिप्टोग्राफी के क्षेत्र में क्रांति लाने के लिए तैयार है।', '  सूचकांक निधियों में निवेश करना दीर्घकालिक वित्तीय विकास के लिए एक लोकप्रिय रणनीति है।', '  जीन चिकित्सा में हाल की प्रगति आनुवंशिक विकारों के इलाज के लिए नई आशा प्रदान करती है।', '  ऑनलाइन शिक्षण मंचों ने छात्रों के शैक्षिक संसाधनों तक पहुँचने के तरीके को बदल दिया है।', '  ऑफ-सीजन के दौरान यूरोप की यात्रा करना एक अधिक बजट-अनुकूल विकल्प हो सकता है।', '  एथलीटों के लिए प्रशिक्षण नियम डेटा विश्लेषण के उपयोग के साथ अधिक परिष्कृत हो गए हैं।', '  स्ट्रीमिंग सेवाएँ लोगों के टेलीविजन और फिल्म सामग्री के उपभोग के तरीके को बदल रही हैं।', '  शाकाहारी व्यंजनों ने लोकप्रियता हासिल की है क्योंकि अधिक लोग पौधे आधारित आहार को अपनाते हैं।', '  टिकाऊ पर्यावरण नीतियों को विकसित करने के लिए जलवायु परिवर्तन अनुसंधान महत्वपूर्ण है।', '  टेलीमेडिसिन अपनी सुविधा और सुलभता के कारण तेजी से लोकप्रिय हो गया है।']


GPU: 0, Part: 0: 100%|██████████| 10/10 [00:00<00:00, 10.69it/s]


Writing to disk complete for 1 partitions


