# Distributed Data Classification with Multiple Quality Classifiers

The notebook demonstrates the use of five quality classifiers for distributed data classification. Each fold was trained on 80% of the data, so the results can be ensembled into a single prediction. These classifers help with annotation which helps data blending for foundation model training.

The classifiers are accelerated using [CrossFit](https://github.com/rapidsai/crossfit), a library that leverages intellegent batching and RAPIDS to accelerate the offline inference on large datasets.

In [1]:
# Silence Warnings (HuggingFace internal warnings)

%env PYTHONWARNINGS=ignore
%env DASK_DATAFRAME__QUERY_PLANNING=False
import warnings
warnings.filterwarnings("ignore")

env: DASK_DATAFRAME__QUERY_PLANNING=False


In [2]:
from nemo_curator import QualityClassifier
from nemo_curator.datasets import DocumentDataset
from nemo_curator.utils.distributed_utils import get_client
import cudf
import dask_cudf

In [3]:
client = get_client(cluster_type="gpu")

# Set File Paths 

In [4]:
output_file_path = "output_data_dir/"
output_file_path = "/home/nfs/syurick/NeMo-Curator/tutorials/distributed_data_classification/nb_q_result"
quality_model_paths = [
    "quality_model0.pth",
    "quality_model1.pth",
    "quality_model2.pth",
    "quality_model3.pth",
    "quality_model4.pth",
]
quality_model_paths = [
    "/home/nfs/syurick/LLM_quality_classifier_inference/ensemble/quality_surge_full_22828_1e5_5ep_bs64_1024_fold0_best.pth",
    "/home/nfs/syurick/LLM_quality_classifier_inference/ensemble/quality_surge_full_22828_1e5_5ep_bs64_1024_fold1_best.pth",
    "/home/nfs/syurick/LLM_quality_classifier_inference/ensemble/quality_surge_full_22828_1e5_5ep_bs64_1024_fold2_best.pth",
    "/home/nfs/syurick/LLM_quality_classifier_inference/ensemble/quality_surge_full_22828_1e5_5ep_bs64_1024_fold3_best.pth",
    "/home/nfs/syurick/LLM_quality_classifier_inference/ensemble/quality_surge_full_22828_1e5_5ep_bs64_1024_fold4_best.pth",
]

# Create and Run Classifiers

In [5]:
# Create sample DataFrame
text = [
    "Quantum computing is set to revolutionize the field of cryptography.",
    "Investing in index funds is a popular strategy for long-term financial growth.",
    "Recent advancements in gene therapy offer new hope for treating genetic disorders.",
    "Online learning platforms have transformed the way students access educational resources.",
    "Traveling to Europe during the off-season can be a more budget-friendly option.",
    "Training regimens for athletes have become more sophisticated with the use of data analytics.",
    "Streaming services are changing the way people consume television and film content.",
    "Vegan recipes have gained popularity as more people adopt plant-based diets.",
    "Climate change research is critical for developing sustainable environmental policies.",
    "Telemedicine has become increasingly popular due to its convenience and accessibility.",
]
df = cudf.DataFrame({"text": text})
dataset = DocumentDataset(dask_cudf.from_cudf(df, npartitions=1))
write_to_filename = False

# Alternatively, read existing directory of JSONL files
# input_file_path="/input_data_dir/"
# input_dataset = DocumentDataset.read_json(
#     input_file_path, backend="cudf", add_filename=True
# )
# write_to_filename = True

In [6]:
fold = 0
pred_columns = []
for quality_model_path in quality_model_paths:
    pred_column = "quality_pred_" + str(fold)
    prob_column = "quality_prob_" + str(fold)
    pred_columns.append(pred_column)

    classifier = QualityClassifier(
        model_path=quality_model_path,
        batch_size=1024,
        pred_column=pred_column,
        prob_column=prob_column,
    )
    dataset = classifier(dataset=dataset)
    fold += 1

Starting Quality classifier inference
Starting Quality classifier inference
Starting Quality classifier inference
Starting Quality classifier inference
Starting Quality classifier inference


In [7]:
%%time

dataset.df.compute()

GPU: 0, Part: 0: 100%|██████████| 10/10 [00:04<00:00,  2.44it/s]
GPU: 0, Part: 0: 100%|██████████| 10/10 [00:00<00:00, 30.69it/s]
GPU: 0, Part: 0: 100%|██████████| 10/10 [00:00<00:00, 29.82it/s]
GPU: 0, Part: 0: 100%|██████████| 10/10 [00:00<00:00, 31.67it/s]
GPU: 0, Part: 0: 100%|██████████| 10/10 [00:00<00:00, 82.84it/s]

CPU times: user 613 ms, sys: 361 ms, total: 974 ms
Wall time: 10.7 s


GPU: 0, Part: 0: 100%|██████████| 10/10 [00:00<00:00, 31.16it/s]


Unnamed: 0,text,quality_prob_0,quality_pred_0,quality_prob_1,quality_pred_1,quality_prob_2,quality_pred_2,quality_prob_3,quality_pred_3,quality_prob_4,quality_pred_4
0,Quantum computing is set to revolutionize the ...,"[0.3728572130203247, 0.499016135931015, 0.1281...",Medium,"[0.3728572130203247, 0.499016135931015, 0.1281...",Medium,"[0.3728572130203247, 0.499016135931015, 0.1281...",Medium,"[0.3728572130203247, 0.499016135931015, 0.1281...",Medium,"[0.3728572130203247, 0.499016135931015, 0.1281...",Medium
1,Investing in index funds is a popular strategy...,"[0.34133055806159973, 0.5345032215118408, 0.12...",Medium,"[0.34133055806159973, 0.5345032215118408, 0.12...",Medium,"[0.34133055806159973, 0.5345032215118408, 0.12...",Medium,"[0.34133055806159973, 0.5345032215118408, 0.12...",Medium,"[0.34133055806159973, 0.5345032215118408, 0.12...",Medium
2,Recent advancements in gene therapy offer new ...,"[0.3898108899593353, 0.4821754992008209, 0.128...",Medium,"[0.3898108899593353, 0.4821754992008209, 0.128...",Medium,"[0.3898108899593353, 0.4821754992008209, 0.128...",Medium,"[0.3898108899593353, 0.4821754992008209, 0.128...",Medium,"[0.3898108899593353, 0.4821754992008209, 0.128...",Medium
3,Online learning platforms have transformed the...,"[0.38701269030570984, 0.4876796007156372, 0.12...",Medium,"[0.38701269030570984, 0.4876796007156372, 0.12...",Medium,"[0.38701269030570984, 0.4876796007156372, 0.12...",Medium,"[0.38701269030570984, 0.4876796007156372, 0.12...",Medium,"[0.38701269030570984, 0.4876796007156372, 0.12...",Medium
4,Traveling to Europe during the off-season can ...,"[0.32102224230766296, 0.5830105543136597, 0.09...",Medium,"[0.32102224230766296, 0.5830105543136597, 0.09...",Medium,"[0.32102224230766296, 0.5830105543136597, 0.09...",Medium,"[0.32102224230766296, 0.5830105543136597, 0.09...",Medium,"[0.32102224230766296, 0.5830105543136597, 0.09...",Medium
5,Training regimens for athletes have become mor...,"[0.34178370237350464, 0.5548713207244873, 0.10...",Medium,"[0.34178370237350464, 0.5548713207244873, 0.10...",Medium,"[0.34178370237350464, 0.5548713207244873, 0.10...",Medium,"[0.34178370237350464, 0.5548713207244873, 0.10...",Medium,"[0.34178370237350464, 0.5548713207244873, 0.10...",Medium
6,Streaming services are changing the way people...,"[0.35998600721359253, 0.525088906288147, 0.114...",Medium,"[0.35998600721359253, 0.525088906288147, 0.114...",Medium,"[0.35998600721359253, 0.525088906288147, 0.114...",Medium,"[0.35998600721359253, 0.525088906288147, 0.114...",Medium,"[0.35998600721359253, 0.525088906288147, 0.114...",Medium
7,Vegan recipes have gained popularity as more p...,"[0.3145926594734192, 0.5717698335647583, 0.113...",Medium,"[0.3145926594734192, 0.5717698335647583, 0.113...",Medium,"[0.3145926594734192, 0.5717698335647583, 0.113...",Medium,"[0.3145926594734192, 0.5717698335647583, 0.113...",Medium,"[0.3145926594734192, 0.5717698335647583, 0.113...",Medium
8,Climate change research is critical for develo...,"[0.3767526149749756, 0.5015591979026794, 0.121...",Medium,"[0.3767526149749756, 0.5015591979026794, 0.121...",Medium,"[0.3767526149749756, 0.5015591979026794, 0.121...",Medium,"[0.3767526149749756, 0.5015591979026794, 0.121...",Medium,"[0.3767526149749756, 0.5015591979026794, 0.121...",Medium
9,Telemedicine has become increasingly popular d...,"[0.34938278794288635, 0.5144903659820557, 0.13...",Medium,"[0.34938278794288635, 0.5144903659820557, 0.13...",Medium,"[0.34938278794288635, 0.5144903659820557, 0.13...",Medium,"[0.34938278794288635, 0.5144903659820557, 0.13...",Medium,"[0.34938278794288635, 0.5144903659820557, 0.13...",Medium


In [8]:
dataset.to_json(output_file_dir=output_file_path, write_to_filename=write_to_filename)

GPU: 0, Part: 0: 100%|██████████| 10/10 [00:04<00:00,  2.41it/s]
GPU: 0, Part: 0: 100%|██████████| 10/10 [00:00<00:00, 27.78it/s]
GPU: 0, Part: 0: 100%|██████████| 10/10 [00:00<00:00, 29.53it/s]
GPU: 0, Part: 0: 100%|██████████| 10/10 [00:00<00:00, 28.27it/s]
GPU: 0, Part: 0: 100%|██████████| 10/10 [00:00<00:00, 30.57it/s]


Writing to disk complete for 1 partitions


# Inspect the Output

In [9]:
output_dataset = DocumentDataset.read_json(output_file_path, backend="cudf", add_filename=write_to_filename)
output_dataset.df.head()

Reading 1 files


Unnamed: 0,quality_pred_0,quality_pred_1,quality_pred_2,quality_pred_3,quality_pred_4,quality_prob_0,quality_prob_1,quality_prob_2,quality_prob_3,quality_prob_4,text
0,Medium,Medium,Medium,Medium,Medium,"[0.3728572130000001, 0.4990161359000001, 0.128...","[0.3728572130000001, 0.4990161359000001, 0.128...","[0.3728572130000001, 0.4990161359000001, 0.128...","[0.3728572130000001, 0.4990161359000001, 0.128...","[0.3728572130000001, 0.4990161359000001, 0.128...",Quantum computing is set to revolutionize the ...
1,Medium,Medium,Medium,Medium,Medium,"[0.3413305581, 0.5345032215, 0.12416620550000003]","[0.3413305581, 0.5345032215, 0.12416620550000003]","[0.3413305581, 0.5345032215, 0.12416620550000003]","[0.3413305581, 0.5345032215, 0.12416620550000003]","[0.3413305581, 0.5345032215, 0.12416620550000003]",Investing in index funds is a popular strategy...
2,Medium,Medium,Medium,Medium,Medium,"[0.38981089000000013, 0.4821754992000001, 0.12...","[0.38981089000000013, 0.4821754992000001, 0.12...","[0.38981089000000013, 0.4821754992000001, 0.12...","[0.38981089000000013, 0.4821754992000001, 0.12...","[0.38981089000000013, 0.4821754992000001, 0.12...",Recent advancements in gene therapy offer new ...
3,Medium,Medium,Medium,Medium,Medium,"[0.38701269030000013, 0.48767960070000005, 0.1...","[0.38701269030000013, 0.48767960070000005, 0.1...","[0.38701269030000013, 0.48767960070000005, 0.1...","[0.38701269030000013, 0.48767960070000005, 0.1...","[0.38701269030000013, 0.48767960070000005, 0.1...",Online learning platforms have transformed the...
4,Medium,Medium,Medium,Medium,Medium,"[0.3210222423000001, 0.5830105542999999, 0.095...","[0.3210222423000001, 0.5830105542999999, 0.095...","[0.3210222423000001, 0.5830105542999999, 0.095...","[0.3210222423000001, 0.5830105542999999, 0.095...","[0.3210222423000001, 0.5830105542999999, 0.095...",Traveling to Europe during the off-season can ...
