# Distributed Data Classification with Quality and Domain Classifiers

The notebook demonstrates the use of two classifiers for distributed data classification, including quality and domain classifiers. The quality classifier is used to classify the quality of the data, while the domain classifier is used to classify the domain of the data. These classifers help with annotation which helps data blending for foundation model training. 

The classifiers are accelerated using CrossFit,(https://github.com/rapidsai/crossfit), a library that leverages intellegent batching and RAPIDS to accelerate the offline inference on large datasets.

In [1]:
#### Silence Warnings (HuggingFace internal warnings)

%env PYTHONWARNINGS=ignore
import warnings
warnings.filterwarnings("ignore")



In [2]:
from dask_cuda import LocalCUDACluster
from dask.distributed import Client
from nemo_curator import DomainClassifier, QualityClassifier
from nemo_curator.datasets import DocumentDataset

In [3]:
cluster = LocalCUDACluster(rmm_async=True, rmm_pool_size="1GB")
client = Client(cluster)

# Define the data file paths 

In [11]:
input_file_path="/input_data_dir/"
output_file_path = "output_data_dir/"
domain_model_path = "domain_model.pth"
quality_model_path = "quality_model.pth"

# Create a Classifier

In [5]:
classifier_type="DomainClassifier" # or "QualityClassifier"

In [6]:
%%time

input_dataset = DocumentDataset.read_json(
    input_file_path, backend="cudf", add_filename=True
)

if classifier_type == "DomainClassifier":
    domain_labels = [
    "Adult",
    "Arts_and_Entertainment",
    "Autos_and_Vehicles",
    "Beauty_and_Fitness",
    "Books_and_Literature",
    "Business_and_Industrial",
    "Computers_and_Electronics",
    "Finance",
    "Food_and_Drink",
    "Games",
    "Health",
    "Hobbies_and_Leisure",
    "Home_and_Garden",
    "Internet_and_Telecom",
    "Jobs_and_Education",
    "Law_and_Government",
    "News",
    "Online_Communities",
    "People_and_Society",
    "Pets_and_Animals",
    "Real_Estate",
    "Science",
    "Sensitive_Subjects",
    "Shopping",
    "Sports",
    "Travel_and_Transportation",
    ]
    classifier = DomainClassifier(
        model_path=domain_model_path,
        labels=domain_labels,
        batch_size=1024,
    )
elif classifier_type == "QualityClassifier":
    quality_labels = ["High", "Medium", "Low"]
    model_file_name = "quality_classifier.pth"
    classifier = QualityClassifier(
        model_path=quality_model_path,
        labels=quality_labels,
        batch_size=1024,
    )
else:
    raise ValueError("Invalid classifier type")

Reading 16 files


CPU times: user 10.5 s, sys: 5.33 s, total: 15.8 s
Wall time: 11.4 s


# Run the  Classifier

Dask operations are lazy, so the the classifier will not run until we call a eager operation like `to_json`, `compute` or `persist`. 

In [8]:
%%time

result_dataset = classifier(dataset=input_dataset)
result_dataset.to_json(output_file_dir=output_file_path, write_to_filename=True)

Starting domain classifier inference


GPU: 0, Part: 1: 100%|██████████| 938/938 [00:09<00:00, 101.99it/s] 
GPU: 0, Part: 3: 100%|██████████| 938/938 [00:10<00:00, 92.36it/s] ]
GPU: 0, Part: 0: 100%|██████████| 938/938 [00:10<00:00, 91.25it/s] ]
GPU: 0, Part: 5: 100%|██████████| 938/938 [00:10<00:00, 88.82it/s] 
GPU: 0, Part: 14: 100%|██████████| 937/937 [00:10<00:00, 88.11it/s] 
GPU: 0, Part: 8: 100%|██████████| 937/937 [00:10<00:00, 85.46it/s] ]
GPU: 0, Part: 9: 100%|██████████| 937/937 [00:10<00:00, 86.16it/s] 
GPU: 0, Part: 4: 100%|██████████| 938/938 [00:10<00:00, 85.65it/s]]
GPU: 0, Part: 11: 100%|██████████| 937/937 [00:11<00:00, 83.73it/s] 
GPU: 0, Part: 6: 100%|██████████| 938/938 [00:11<00:00, 83.62it/s]
GPU: 0, Part: 10: 100%|██████████| 937/937 [00:11<00:00, 81.27it/s] 
GPU: 0, Part: 2: 100%|██████████| 938/938 [00:12<00:00, 72.59it/s]]
GPU: 0, Part: 7: 100%|██████████| 937/937 [00:13<00:00, 71.75it/s]
GPU: 0, Part: 12: 100%|██████████| 937/937 [00:13<00:00, 69.12it/s]
GPU: 0, Part: 15: 100%|██████████| 937/937 

Writing to disk complete for 16 partitions
CPU times: user 2.34 s, sys: 2.24 s, total: 4.58 s
Wall time: 17.2 s


#### Inspect the Output

In [9]:
output_dataset = DocumentDataset.read_json(output_file_path, backend="cudf", add_filename=True)
output_dataset.df.head(2)

Reading 16 files


Unnamed: 0,adlr_id,domain_pred,filename,id,pred,source_id,split_id,text,url
0,cc-2022-40-0431053204,Online_Communities,00.jsonl,a8083fe4-525d-4888-8513-b91f43bd8ee1,Online_Communities,crawl-data-CC-MAIN-2022-40-segments-1664030336...,lambada-0003225258-0000,Having been a community leader—and member—for ...,https://lisalarter.com/7-tips-for-building-ste...
1,cc-2022-40-0510168267,Finance,00.jsonl,559febdc-cb7f-4217-897a-c8dac325123b,Finance,crawl-data-CC-MAIN-2022-40-segments-1664030337...,lambada-0003918122-0000,Zelle is a way of sending money to almost anyo...,https://oregonmassageandwellnessclinic.com/app...


##### Cleanup the output file

In [10]:
!rm -rf $output_file_path