# Distributed Data Classification with NeMo Curator's `MultilingualDomainClassifier`

This notebook demonstrates the use of NeMo Curator's `MultilingualDomainClassifier`. The [multilingual domain classifier](https://huggingface.co/nvidia/multilingual-domain-classifier) is used to classify the domain of texts in any of 52 languages, including English. It helps with data annotation, which is useful in data blending for foundation model training. Please refer to the NemoCurator Multilingual Domain Classifier Hugging Face page for more information about the multilingual domain classifier, including its output labels, here: https://huggingface.co/nvidia/multilingual-domain-classifier.

Before running this notebook, please see this [Installation Guide](https://docs.nvidia.com/nemo/curator/latest/admin/installation.html#admin-installation) page for instructions on how to install NeMo Curator. Be sure to use an installation method which includes GPU dependencies.

In [None]:
# Silence Ray logs
import os

os.environ["LOGURU_LEVEL"] = "ERROR"



The following imports are required for this tutorial:

In [None]:
import pandas as pd
from ray_curator.core.client import get_ray_client
from ray_curator.pipeline import Pipeline
from ray_curator.stages.text.classifiers import MultilingualDomainClassifier
from ray_curator.stages.text.io.reader.jsonl import JsonlReader
from ray_curator.stages.text.io.writer.jsonl import JsonlWriter

To run a pipeline in NeMo Curator, we must start a Ray cluster. This can be done manually (see the [Ray documentation](https://docs.ray.io/en/latest/ray-core/starting-ray.html)) or with Curator's `get_ray_client` function:

In [None]:
try:
    client = get_ray_client()
except Exception as e:
    msg = f"Error initializing Ray client: {e}"
    raise RuntimeError(msg) from e

cuDF Spilling is enabled


# Initialize Read, Classification, and Write Stages

Functions in NeMo Curator are called stages. For this tutorial, we will initialize 3 stages: a JSONL file reader, a multilingual domain classification stage, and a JSONL file writer.

For this tutorial, we will create a sample JSONL file to use:

In [None]:
input_file_path = "./input_data_dir"

# Create sample dataset for the tutorial
text = [
    # Chinese
    "量子计算将彻底改变密码学领域。",
    # Spanish
    "Invertir en fondos indexados es una estrategia popular para el crecimiento financiero a largo plazo.",
    # English
    "Recent advancements in gene therapy offer new hope for treating genetic disorders.",
    # Hindi
    "ऑनलाइन शिक्षण प्लेटफार्मों ने छात्रों के शैक्षिक संसाधनों तक पहुंचने के तरीके को बदल दिया है।",
    # Bengali
    "অফ-সিজনে ইউরোপ ভ্রমণ করা আরও বাজেট-বান্ধব বিকল্প হতে পারে।",
    # Portuguese
    "Os regimes de treinamento para atletas se tornaram mais sofisticados com o uso de análise de dados.",
    # Russian
    "Стриминговые сервисы меняют способ потребления людьми телевизионного и киноконтента.",
    # Japanese
    "植物ベースの食生活を採用する人が増えるにつれて、ビーガンレシピの人気が高まっています。",
    # Vietnamese
    "Nghiên cứu về biến đổi khí hậu có vai trò quan trọng trong việc phát triển các chính sách môi trường bền vững.",
    # Marathi
    "टेलीमेडिसिन त्याच्या सोयी आणि सुलभतेमुळे अधिक लोकप्रिय झाले आहे.",
]
df = pd.DataFrame({"text": text})

try:
    os.makedirs(input_file_path, exist_ok=True)
    df.to_json(input_file_path + "/data.jsonl", orient="records", lines=True)
except Exception as e:
    msg = f"Error creating input file: {e}"
    raise RuntimeError(msg) from e

We can define the reader stage with:

In [None]:
# Read existing directory of JSONL files
read_stage = JsonlReader(
    file_paths=input_file_path,
    files_per_partition=1,
    reader="pandas",
)

The classifier stage is broken down under the hood into a tokenizer stage and a model inference stage. Tokenization is run on the CPU while model inference is run on the GPU. Optionally, the classifier predictions may be filtered to include only texts with values listed in `filter_by`.

In [None]:
# Initialize the multilingual domain classifier
classifier_stage = MultilingualDomainClassifier()

# If desired, you may filter your dataset with:
# classifier_stage = MultilingualDomainClassifier(filter_by=["Science", "Health"])
# See full list of domains here: https://huggingface.co/nvidia/multilingual-domain-classifier

Finally, we can define a stage for writing the results:

In [None]:
# Write results to a directory
output_file_path = "./classifier_results"
write_stage = JsonlWriter(output_dir=output_file_path)

# Initialize Pipeline

In NeMo Curator, we use pipelines to run distributed data workflows using Ray. Pipelines take care of resource allocation and autoscaling to achieve enhanced performance and minimize GPU idleness.

For the distributed data classifiers, we are able to achieve speedups by ensuring that model inference is run in parallel across all available GPUs, while other stages such as I/O, tokenization, and filtering are run across all available CPUs. This is possible because Curator pipelines are composable, which allows each stage in a pipeline to run independently and with its own specified hardware resources.

In [None]:
classifier_pipeline = Pipeline(name="classifier_pipeline", description="Run a classifier pipeline")

# Add stages to the pipeline
classifier_pipeline.add_stage(read_stage)
classifier_pipeline.add_stage(classifier_stage)
classifier_pipeline.add_stage(write_stage)

Composability is also what allows a classifier to sit between pre-processing and post-processing stages. Typical text pre-processing add-ons include text normalization (lowercasing, URL/email removal, Unicode cleanup) and language identification and filtering (to keep only target languages). A full pipeline may look something like:

```python
pipeline = Pipeline(name="full_pipeline")
pipeline.add_stage(read_stage)                # reader (JSONL/S3/etc.)
pipeline.add_stage(lang_id_stage)             # optional: language filter
pipeline.add_stage(classifier_stage)          # classifier
pipeline.add_stage(write_stage)               # writer (JSONL/Parquet)
```

# Run the  Classifier

Let's run the full classifier pipeline:

In [None]:
%%time

# Run the pipeline
result = classifier_pipeline.run()

# Inspect the Output

The write stage returns a list of written files. We can read the output file as a Pandas DataFrame for inspection.

In [None]:
result_file = result[0].data[0]

result_df = pd.read_json(result_file, lines=True)
result_df.head()

We can see that the predictions were generated as expected.