# Distributed Data Classification with NeMo Curator's `DomainClassifier`

This notebook demonstrates the use of NeMo Curator's `DomainClassifier`. The [domain classifier](https://huggingface.co/nvidia/domain-classifier) is used to classify the domain of a text. It helps with data annotation, which is useful in data blending for foundation model training. Please refer to the NemoCurator Domain Classifier Hugging Face page for more information about the domain classifier, including its output labels, here: https://huggingface.co/nvidia/domain-classifier.

Before running this notebook, please see this [Installation Guide](https://docs.nvidia.com/nemo/curator/latest/admin/installation.html#admin-installation) page for instructions on how to install NeMo Curator. Be sure to use an installation method which includes GPU dependencies.

In [1]:
# Silence Ray logs
import os

os.environ["LOGURU_LEVEL"] = "ERROR"

The following imports are required for this tutorial:

In [None]:
import pandas as pd
from ray_curator.backends.xenna import XennaExecutor
from ray_curator.core.client import get_ray_client
from ray_curator.pipeline import Pipeline
from ray_curator.stages.text.classifiers import DomainClassifier
from ray_curator.stages.text.io.reader.jsonl import JsonlReader
from ray_curator.stages.text.io.writer.jsonl import JsonlWriter

To run a pipeline in NeMo Curator, we must start a Ray cluster. This can be done manually (see the [Ray documentation](https://docs.ray.io/en/latest/ray-core/starting-ray.html)) or with Curator's `get_ray_client` function:

In [None]:
client = get_ray_client()

# Initialize Read, Classification, and Write Stages

Functions in NeMo Curator are called stages. For this tutorial, we will initialize 3 stages: a JSONL file reader, a domain classification stage, and a JSONL file writer.

For this tutorial, we will create a sample JSONL file to use:

In [4]:
input_file_path = "./input_data_dir"
os.makedirs(input_file_path, exist_ok=True)

# Create sample dataset for the tutorial
text = [
    "Quantum computing is set to revolutionize the field of cryptography.",
    "Investing in index funds is a popular strategy for long-term financial growth.",
    "Recent advancements in gene therapy offer new hope for treating genetic disorders.",
    "Online learning platforms have transformed the way students access educational resources.",
    "Traveling to Europe during the off-season can be a more budget-friendly option.",
    "Training regimens for athletes have become more sophisticated with the use of data analytics.",
    "Streaming services are changing the way people consume television and film content.",
    "Vegan recipes have gained popularity as more people adopt plant-based diets.",
    "Climate change research is critical for developing sustainable environmental policies.",
    "Telemedicine has become increasingly popular due to its convenience and accessibility.",
]
df = pd.DataFrame({"text": text})
df.to_json(input_file_path + "/data.jsonl", orient="records", lines=True)

We can define the reader stage with:

In [5]:
# Read existing directory of JSONL files
read_stage = JsonlReader(
    file_paths=input_file_path,
    files_per_partition=1,
    reader="pandas",
)

The classifier stage is broken down under the hood into a tokenizer stage and a model inference stage. Tokenization is run on the CPU while model inference is run on the GPU. Optionally, the classifier predictions may be filtered to include only texts with values listed in `filter_by`.

In [6]:
# Initialize the domain classifier
classifier_stage = DomainClassifier()

# If desired, you may filter your dataset with:
# classifier_stage = DomainClassifier(filter_by=["Computers_and_Electronics", "Health"])
# See full list of domains here: https://huggingface.co/nvidia/domain-classifier

Finally, we can define a stage for writing the results:

In [7]:
# Write results to a directory
output_file_path = "./classifier_results"
write_stage = JsonlWriter(output_dir=output_file_path)

# Initialize Executor and Pipeline

In NeMo Curator, we use executors and pipelines to run distributed data pipelines using Ray. These executors and pipelines take care of resource allocation and autoscaling to achieve enhanced performance and minimize GPU idleness.

For the distributed data classifiers, we are able to achieve speedups by ensuring that model inference is run in parallel across all available GPUs, while other stages such as I/O, tokenization, and filtering are run across all available CPUs.

In [8]:
executor = XennaExecutor()
classifier_pipeline = Pipeline(name="classifier_pipeline", description="Run a classifier pipeline")

# Add stages to the pipeline
classifier_pipeline.add_stage(read_stage)
classifier_pipeline.add_stage(classifier_stage)
classifier_pipeline.add_stage(write_stage)

Pipeline(name='classifier_pipeline', stages=[jsonl_reader(JsonlReader), domain_classifier_classifier(DomainClassifier), jsonl_writer(JsonlWriter)])

# Run the  Classifier

Let's run the full pipeline:

In [9]:
%%time

# Run the pipeline with the executor
result = classifier_pipeline.run(executor)

2025-08-19 10:31:07,518	INFO worker.py:1606 -- Using address 127.0.1.1:6390 set in the environment variable RAY_ADDRESS
2025-08-19 10:31:07,527	INFO worker.py:1747 -- Connecting to existing Ray cluster at address: 127.0.1.1:6390...
2025-08-19 10:31:07,547	INFO worker.py:1918 -- Connected to Ray cluster. View the dashboard at [1m[32mhttp://127.0.0.1:8276 [39m[22m
2025-08-19 10:31:08,620	INFO worker.py:1606 -- Using address 127.0.1.1:6390 set in the environment variable RAY_ADDRESS
2025-08-19 10:31:08,623	INFO worker.py:1747 -- Connecting to existing Ray cluster at address: 127.0.1.1:6390...
2025-08-19 10:31:08,624	INFO worker.py:1765 -- Calling ray.init() again after it has already been called.
Fetching 10 files: 100%|██████████| 10/10 [00:00<00:00, 3169.58it/s]


CPU times: user 720 ms, sys: 308 ms, total: 1.03 s
Wall time: 11.9 s


Fetching 10 files: 100%|██████████| 10/10 [00:00<00:00, 46192.78it/s]


# Inspect the Output

The write stage returns a list of written files. We can read the output file as a Pandas DataFrame for inspection.

In [10]:
result_file = result[0].data[0]

result_df = pd.read_json(result_file, lines=True)
result_df.head()


Unnamed: 0,text,domain_pred
0,Quantum computing is set to revolutionize the ...,Computers_and_Electronics
1,Investing in index funds is a popular strategy...,Finance
2,Recent advancements in gene therapy offer new ...,Health
3,Online learning platforms have transformed the...,Jobs_and_Education
4,Traveling to Europe during the off-season can ...,Travel_and_Transportation


We can see that the predictions were generated as expected.