# Distributed Data Classification with NeMo Curator's `QualityClassifier`

This notebook demonstrates the use of NeMo Curator's `QualityClassifier`. The [quality classifier](https://huggingface.co/nvidia/quality-classifier-deberta) is used to classify text as high, medium, or low quality. This helps with data annotation, which is useful in data blending for foundation model training. Please refer to the NemoCurator Quality Classifier DeBERTa Hugging Face page for more information about the quality classifier, including its output labels, here: https://huggingface.co/nvidia/quality-classifier-deberta.

The quality classifier is accelerated using [CrossFit](https://github.com/rapidsai/crossfit), a library that leverages intellegent batching and RAPIDS to accelerate the offline inference on large datasets.

Before running this notebook, please see this [Getting Started](https://github.com/NVIDIA/NeMo-Curator?tab=readme-ov-file#get-started) page for instructions on how to install NeMo Curator.

In [1]:
# Silence Warnings (HuggingFace internal warnings)

%env PYTHONWARNINGS=ignore
import warnings
warnings.filterwarnings("ignore")



In [2]:
from nemo_curator import get_client
from nemo_curator.classifiers import QualityClassifier
from nemo_curator.datasets import DocumentDataset
import cudf
import dask_cudf

In [3]:
client = get_client(cluster_type="gpu")

cuDF Spilling is enabled


# Set Output File Path

The user should specify an empty directory below for storing the output results.

In [None]:
output_file_path = "./quality_results/"

# Prepare Text Data and Initialize Classifier

In [5]:
low_quality_text = """
Volunteering

It's all about the warm, fuzzy feeling when you serve the community, without expectation of gain. Volunteering offers you the necessary experience and development skills to take forward with you, as you venture out to work with other people and apply what you learn, to achieve your career goals.

HOW IT WORKS

SEARCH

BOOK NOW

ENJOY THE SHOW

GET A FREE QUOTE

Planning your event ahead of time is the right move. Contact our experts and let us surprise you.
"""

In [6]:
medium_quality_text = "Traveling to Europe during the off-season can be a more budget-friendly option."

In [7]:
high_quality_text = """
Sharapova has been in New Zealand since well before the New Year, preparing for her 2011 start and requested the opening day match to test her form. "My last tournament was over two months ago and it will be really good to get back playing again."

"My priority since I have been here has been to adjust to time and conditions. I have had a couple of practices a day and think that has been really important."

The three-time Grand Slam champion who once stood number one next plays Voracova after winning their only previous match in 2003.
"""

In [8]:
# Create sample DataFrame
text = [low_quality_text, medium_quality_text, high_quality_text]
df = cudf.DataFrame({"text": text})
input_dataset = DocumentDataset(dask_cudf.from_cudf(df, npartitions=1))
write_to_filename = False

# Alternatively, read existing directory of JSONL files
# input_file_path="/input_data_dir/"
# input_dataset = DocumentDataset.read_json(
#     input_file_path, backend="cudf", add_filename=True
# )
# write_to_filename = True

In [9]:
classifier = QualityClassifier(batch_size=1024)

# If desired, you may filter your dataset with:
# classifier = QualityClassifier(batch_size=1024, filter_by=["High", "Medium"])

# Run the  Classifier

Dask operations are lazy, so the the classifier will not run until we call an eager operation like `to_json`, `compute`, or `persist`. 

In [10]:
%%time

result_dataset = classifier(dataset=input_dataset)
result_dataset.to_json(output_path=output_file_path, write_to_filename=write_to_filename)

Starting quality classifier inference
Writing to disk complete for 1 partition(s)
CPU times: user 2.84 s, sys: 1.2 s, total: 4.04 s
Wall time: 19.8 s


# Inspect the Output

In [11]:
output_dataset = DocumentDataset.read_json(output_file_path, backend="cudf", add_filename=write_to_filename)
output_dataset.head(3)

Reading 1 files


Unnamed: 0,quality_pred,quality_prob,text
0,Low,"[0.0006659966000000001, 0.037424959199999996, ...","\nVolunteering\n\nIt's all about the warm, fuz..."
1,Medium,"[0.2652127147, 0.6983160973, 0.0364712216]",Traveling to Europe during the off-season can ...
2,High,"[0.7135943174000001, 0.2841255367, 0.002280103...",\nSharapova has been in New Zealand since well...
