# Distributed Data Classification with Multiple Classifiers

Cross-validation is a machine learning technique in which multiple models are trained on multiple subsets of your data and validated on the remaining data portions. It is useful because it reduces the risk of overfitting to your data and provides a better estimate of how the model will perform on unseen data. This is particularly valuable when dealing with limited data, as it allows for more efficient use of the available samples.

In this tutorial, we demonstrate how to use NeMo Curator's `PyTorchClassifier` class to load and perform batched inference with multiple pretrained models. We assume the user has pretrained PTH model files, with [DeBERTaV3](https://huggingface.co/microsoft/deberta-v3-base) as the base model used for training. The classifiers are accelerated using [CrossFit](https://github.com/rapidsai/crossfit), a library that leverages intellegent batching and RAPIDS to accelerate the offline inference on large datasets.

First, let's run some preliminary imports and set up our Dask client.

In [1]:
# Silence Warnings (HuggingFace internal warnings)
%env PYTHONWARNINGS=ignore
import warnings
warnings.filterwarnings("ignore")



In [2]:
from nemo_curator.classifiers import PyTorchClassifier
from nemo_curator.datasets import DocumentDataset
from nemo_curator import get_client
import cudf
import dask_cudf

In [3]:
client = get_client(cluster_type="gpu")

cuDF Spilling is enabled


# Prepare Dataset and Set File Paths

Next, we need to create or read the dataset on which we want to run inference. In this notebook, we provide a sample dataset with 10 text sentences to evaluate. Alternatively, the user may read in their own existing data (e.g., JSON or Parquet files) as demonstrated by the commented code.

In [4]:
# Create sample DataFrame
text = [
    "Quantum computing is set to revolutionize the field of cryptography.",
    "Investing in index funds is a popular strategy for long-term financial growth.",
    "Recent advancements in gene therapy offer new hope for treating genetic disorders.",
    "Online learning platforms have transformed the way students access educational resources.",
    "Traveling to Europe during the off-season can be a more budget-friendly option.",
    "Training regimens for athletes have become more sophisticated with the use of data analytics.",
    "Streaming services are changing the way people consume television and film content.",
    "Vegan recipes have gained popularity as more people adopt plant-based diets.",
    "Climate change research is critical for developing sustainable environmental policies.",
    "Telemedicine has become increasingly popular due to its convenience and accessibility.",
]
df = cudf.DataFrame({"text": text})
dataset = DocumentDataset(dask_cudf.from_cudf(df, npartitions=1))
write_to_filename = False

# Alternatively, read existing directory of JSONL files
# input_file_path="/input_data_dir/"
# input_dataset = DocumentDataset.read_json(
#     input_file_path, backend="cudf", add_filename=True
# )
# write_to_filename = True

The user should also specify where to write the results, as well as the local file paths to the pretrained PyTorch classifiers. Finally, the user should include the labels the classifier is expected to produce.

In [None]:
output_file_path = "output_data_dir/"
model_paths = [
    "model0.pth",
    "model1.pth",
    "model2.pth",
    "model3.pth",
    "model4.pth",
]
labels = ["label_a", "label_b", "label_c"]

# Run Classification with Multiple Models

Now we can use the `PyTorchClassifier` class to load each of our PyTorch models and run inference. We will write the results to a JSON file.

In [6]:
fold = 0
pred_columns = []
for model_path in model_paths:
    pred_column = "pred_" + str(fold)
    prob_column = "prob_" + str(fold)
    pred_columns.append(pred_column)

    classifier = PyTorchClassifier(
        pretrained_model_name_or_path=model_path,
        labels=labels,
        batch_size=1024,
        text_field="text",
        pred_column=pred_column,
        prob_column=prob_column,
    )
    dataset = classifier(dataset=dataset)
    fold += 1

Starting PyTorch classifier inference
Starting PyTorch classifier inference
Starting PyTorch classifier inference
Starting PyTorch classifier inference
Starting PyTorch classifier inference


In [7]:
%%time

dataset.to_json(output_file_dir=output_file_path, write_to_filename=write_to_filename)

GPU: tcp://127.0.0.1:34075, Part: 0: 100%|██████████| 10/10 [00:08<00:00,  1.23it/s]
GPU: tcp://127.0.0.1:34075, Part: 0: 100%|██████████| 10/10 [00:05<00:00,  1.83it/s]
GPU: tcp://127.0.0.1:34075, Part: 0: 100%|██████████| 10/10 [00:05<00:00,  1.81it/s]
GPU: tcp://127.0.0.1:34075, Part: 0: 100%|██████████| 10/10 [00:05<00:00,  1.80it/s]
GPU: tcp://127.0.0.1:34075, Part: 0: 100%|██████████| 10/10 [00:04<00:00,  2.02it/s]

Writing to disk complete for 1 partitions
CPU times: user 5.39 s, sys: 3.1 s, total: 8.49 s
Wall time: 48.8 s


GPU: tcp://127.0.0.1:34075, Part: 0: 100%|██████████| 10/10 [00:05<00:00,  1.80it/s]


# Inspect the Output

Finally, let's verify that everything worked as expected.

In [8]:
output_dataset = DocumentDataset.read_json(output_file_path, backend="cudf", add_filename=write_to_filename)
output_dataset.df.head()

Reading 1 files


Unnamed: 0,pred_0,pred_1,pred_2,pred_3,pred_4,prob_0,prob_1,prob_2,prob_3,prob_4,text
0,label_b,label_b,label_b,label_b,label_b,"[0.37283509970000006, 0.49910834430000006, 0.1...","[0.3027972281, 0.5215288401, 0.1756739765]","[0.41288739440000005, 0.5265461801999999, 0.06...","[0.32485893370000013, 0.46514019370000004, 0.2...","[0.3685780168000001, 0.5256645678999999, 0.105...",Quantum computing is set to revolutionize the ...
1,label_b,label_b,label_b,label_b,label_b,"[0.34135937690000007, 0.5343321562, 0.1243084297]","[0.34347015620000004, 0.5304207801999999, 0.12...","[0.4346009791000001, 0.5130862594, 0.052312787...","[0.3181181848000001, 0.4944583774000001, 0.187...","[0.39643365140000003, 0.5143401027, 0.08922628...",Investing in index funds is a popular strategy...
2,label_b,label_b,label_b,label_b,label_b,"[0.38975748420000006, 0.48216831680000005, 0.1...","[0.33265304570000004, 0.5090963244, 0.1582506448]","[0.44722059370000006, 0.4945448935000001, 0.05...","[0.3444236219000001, 0.45550799370000006, 0.20...","[0.3919632137000001, 0.5084934831, 0.099543325...",Recent advancements in gene therapy offer new ...
3,label_b,label_b,label_b,label_b,label_b,"[0.38686266540000014, 0.48784771560000006, 0.1...","[0.3482291102, 0.5138959289, 0.13787493110000001]","[0.4499093592, 0.49849084020000006, 0.05159985...","[0.3489176929000001, 0.45996120570000004, 0.19...","[0.38338246940000015, 0.5131927133, 0.10342480...",Online learning platforms have transformed the...
4,label_b,label_b,label_b,label_b,label_b,"[0.3207181096000001, 0.5833522080999999, 0.095...","[0.3277938664, 0.5600519180000001, 0.112154245...","[0.39969193940000003, 0.5546463728000001, 0.04...","[0.3249147236000001, 0.5021025537999999, 0.172...","[0.35228130220000003, 0.5585800408999999, 0.08...",Traveling to Europe during the off-season can ...


Thank you for reading! In this tutorial, we demonstrated how to use the `PyTorchClassifier` to load locally-stored PyTorch models and run inference on our dataset.

For more information about NeMo Curator's `DistributedDataClassifier`, please reference the [documentation page](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/distributeddataclassification.html). For an example on how to run NeMo Curator's `DomainClassifier` and `QualityClassifier`, please see [this sample notebook](https://github.com/NVIDIA/NeMo-Curator/blob/main/tutorials/distributed_data_classification/distributed_data_classification.ipynb).