# Distributed Data Classification with NeMo Curator's `MultilingualDomainClassifier`

This notebook demonstrates the use of NeMo Curator's `MultilingualDomainClassifier`. The [multilingual domain classifier](https://huggingface.co/nvidia/multilingual-domain-classifier) is used to classify the domain of texts in any of 52 languages, including English. It helps with data annotation, which is useful in data blending for foundation model training. Please refer to the NemoCurator Multilingual Domain Classifier Hugging Face page for more information about the multilingual domain classifier, including its output labels, here: https://huggingface.co/nvidia/multilingual-domain-classifier.

The multilingual domain classifier is accelerated using [CrossFit](https://github.com/rapidsai/crossfit), a library that leverages intellegent batching and RAPIDS to accelerate the offline inference on large datasets.

Before running this notebook, please see this [Getting Started](https://github.com/NVIDIA/NeMo-Curator?tab=readme-ov-file#get-started) page for instructions on how to install NeMo Curator.

In [1]:
# Silence Warnings (HuggingFace internal warnings)

%env PYTHONWARNINGS=ignore
import warnings
warnings.filterwarnings("ignore")



In [2]:
from nemo_curator import get_client
from nemo_curator.classifiers import MultilingualDomainClassifier
from nemo_curator.datasets import DocumentDataset
import cudf
import dask_cudf

In [3]:
client = get_client(cluster_type="gpu")

cuDF Spilling is enabled


# Set Output File Path

The user should specify an empty directory below for storing the output results.

In [None]:
output_file_path = "./multilingual_domain_results/"

# Prepare Text Data and Initialize Classifier

In [5]:
# Create sample DataFrame
text = [
    # Chinese
    "量子计算将彻底改变密码学领域。",
    # Spanish
    "Invertir en fondos indexados es una estrategia popular para el crecimiento financiero a largo plazo.",
    # English
    "Recent advancements in gene therapy offer new hope for treating genetic disorders.",
    # Hindi
    "ऑनलाइन शिक्षण प्लेटफार्मों ने छात्रों के शैक्षिक संसाधनों तक पहुंचने के तरीके को बदल दिया है।",
    # Bengali
    "অফ-সিজনে ইউরোপ ভ্রমণ করা আরও বাজেট-বান্ধব বিকল্প হতে পারে।",
    # Portuguese
    "Os regimes de treinamento para atletas se tornaram mais sofisticados com o uso de análise de dados.",
    # Russian
    "Стриминговые сервисы меняют способ потребления людьми телевизионного и киноконтента.",
    # Japanese
    "植物ベースの食生活を採用する人が増えるにつれて、ビーガンレシピの人気が高まっています。",
    # Vietnamese
    "Nghiên cứu về biến đổi khí hậu có vai trò quan trọng trong việc phát triển các chính sách môi trường bền vững.",
    # Marathi
    "टेलीमेडिसिन त्याच्या सोयी आणि सुलभतेमुळे अधिक लोकप्रिय झाले आहे.",
]
df = cudf.DataFrame({"text": text})
input_dataset = DocumentDataset(dask_cudf.from_cudf(df, npartitions=1))
write_to_filename = False

# Alternatively, read existing directory of JSONL files
# input_file_path="/input_data_dir/"
# input_dataset = DocumentDataset.read_json(
#     input_file_path, backend="cudf", add_filename=True
# )
# write_to_filename = True

In [6]:
classifier = MultilingualDomainClassifier(batch_size=1024)

# If desired, you may filter your dataset with:
# classifier = MultilingualDomainClassifier(batch_size=1024, filter_by=["Science", "Health"])
# See full list of domains here: https://huggingface.co/nvidia/multilingual-domain-classifier

# Run the  Classifier

Dask operations are lazy, so the the classifier will not run until we call an eager operation like `to_json`, `compute`, or `persist`. 

In [7]:
%%time

result_dataset = classifier(dataset=input_dataset)
result_dataset.to_json(output_path=output_file_path, write_to_filename=write_to_filename)

Starting multilingual domain classifier inference
Writing to disk complete for 1 partition(s)
CPU times: user 2.55 s, sys: 1.48 s, total: 4.02 s
Wall time: 18.2 s


# Inspect the Output

In [8]:
output_dataset = DocumentDataset.read_json(output_file_path, backend="cudf", add_filename=write_to_filename)
output_dataset.head()

Reading 1 files


Unnamed: 0,domain_pred,text
0,Science,量子计算将彻底改变密码学领域。
1,Finance,Invertir en fondos indexados es una estrategia...
2,Health,Recent advancements in gene therapy offer new ...
3,Jobs_and_Education,ऑनलाइन शिक्षण प्लेटफार्मों ने छात्रों के शैक्ष...
4,Travel_and_Transportation,অফ-সিজনে ইউরোপ ভ্রমণ করা আরও বাজেট-বান্ধব বিকল...
