# Distributed Data Classification with Domain and Quality Classifiers

The notebook demonstrates the use of two classifiers for distributed data classification, including domain and quality classifiers. The domain classifier is used to classify the domain of the data, while the quality classifier is used to classify the quality of the data. These classifers help with annotation which helps data blending for foundation model training.

The classifiers are accelerated using [CrossFit](https://github.com/rapidsai/crossfit), a library that leverages intellegent batching and RAPIDS to accelerate the offline inference on large datasets.

In [1]:
# Silence Warnings (HuggingFace internal warnings)

%env PYTHONWARNINGS=ignore
%env DASK_DATAFRAME__QUERY_PLANNING=False
import warnings
warnings.filterwarnings("ignore")

env: DASK_DATAFRAME__QUERY_PLANNING=False


In [2]:
from nemo_curator import get_client
from nemo_curator.classifiers import DomainClassifier, QualityClassifier
from nemo_curator.datasets import DocumentDataset
import cudf
import dask_cudf

In [3]:
client = get_client(cluster_type="gpu")

# Set File Paths 

In [4]:
output_file_path = "output_data_dir/"
quality_model_path = "quality_model.pth"

# Create a Classifier

In [5]:
classifier_type = "DomainClassifier" # or "QualityClassifier"

In [6]:
# Create sample DataFrame
text = [
    "Quantum computing is set to revolutionize the field of cryptography.",
    "Investing in index funds is a popular strategy for long-term financial growth.",
    "Recent advancements in gene therapy offer new hope for treating genetic disorders.",
    "Online learning platforms have transformed the way students access educational resources.",
    "Traveling to Europe during the off-season can be a more budget-friendly option.",
    "Training regimens for athletes have become more sophisticated with the use of data analytics.",
    "Streaming services are changing the way people consume television and film content.",
    "Vegan recipes have gained popularity as more people adopt plant-based diets.",
    "Climate change research is critical for developing sustainable environmental policies.",
    "Telemedicine has become increasingly popular due to its convenience and accessibility.",
]
df = cudf.DataFrame({"text": text})
input_dataset = DocumentDataset(dask_cudf.from_cudf(df, npartitions=1))
write_to_filename = False

# Alternatively, read existing directory of JSONL files
# input_file_path="/input_data_dir/"
# input_dataset = DocumentDataset.read_json(
#     input_file_path, backend="cudf", add_filename=True
# )
# write_to_filename = True

In [7]:
if classifier_type == "DomainClassifier":
    classifier = DomainClassifier(batch_size=1024)

elif classifier_type == "QualityClassifier":
    classifier = QualityClassifier(
        model_path=quality_model_path,
        batch_size=1024,
    )

else:
    raise ValueError("Invalid classifier type")

# Run the  Classifier

Dask operations are lazy, so the the classifier will not run until we call a eager operation like `to_json`, `compute` or `persist`. 

In [8]:
%%time

result_dataset = classifier(dataset=input_dataset)
result_dataset.to_json(output_file_dir=output_file_path, write_to_filename=write_to_filename)

Starting domain classifier inference


GPU: 0, Part: 0: 100%|██████████| 10/10 [00:04<00:00,  2.12it/s]


Writing to disk complete for 1 partitions
CPU times: user 393 ms, sys: 244 ms, total: 638 ms
Wall time: 6.04 s


# Inspect the Output

In [9]:
output_dataset = DocumentDataset.read_json(output_file_path, backend="cudf", add_filename=write_to_filename)
output_dataset.df.head()

Reading 1 files


Unnamed: 0,domain_pred,text
0,Computers_and_Electronics,Quantum computing is set to revolutionize the ...
1,Finance,Investing in index funds is a popular strategy...
2,Health,Recent advancements in gene therapy offer new ...
3,Jobs_and_Education,Online learning platforms have transformed the...
4,Travel_and_Transportation,Traveling to Europe during the off-season can ...
