# Distributed Data Classification with NeMo Curator's `FineWebNemotronEduClassifier`

This notebook demonstrates the use of NeMo Curator's `FineWebNemotronEduClassifier`. The [FineWeb Nemotron-4 Edu classifier](https://huggingface.co/nvidia/nemocurator-fineweb-nemotron-4-edu-classifier) is used to determine the educational value (score 0-5 from low to high) of a text. It helps with data annotation, which is useful in data blending for foundation model training. Please refer to the Hugging Face page for more information about the NemoCurator FineWeb Nemotron-4 Edu Classifier, including its output labels, here: https://huggingface.co/nvidia/nemocurator-fineweb-nemotron-4-edu-classifier.

The FineWeb Nemotron-4 Edu classifier is accelerated using [CrossFit](https://github.com/rapidsai/crossfit), a library that leverages intellegent batching and RAPIDS to accelerate the offline inference on large datasets.

Before running this notebook, please see this [Getting Started](https://github.com/NVIDIA/NeMo-Curator?tab=readme-ov-file#get-started) page for instructions on how to install NeMo Curator.

In [1]:
# Silence Warnings (HuggingFace internal warnings)

%env PYTHONWARNINGS=ignore
import warnings
warnings.filterwarnings("ignore")



In [None]:
from nemo_curator import get_client
from nemo_curator.classifiers import FineWebNemotronEduClassifier
from nemo_curator.datasets import DocumentDataset
import cudf
import dask_cudf

In [3]:
client = get_client(cluster_type="gpu")

cuDF Spilling is enabled


# Set Output File Path

The user should specify an empty directory below for storing the output results.

In [None]:
output_file_path = "./fineweb_nemotron_edu_results/"

# Prepare Text Data and Initialize Classifier

In [5]:
# Create sample DataFrame
text = [
    "Quantum computing is set to revolutionize the field of cryptography.",
    "Investing in index funds is a popular strategy for long-term financial growth.",
    "Recent advancements in gene therapy offer new hope for treating genetic disorders.",
    "Online learning platforms have transformed the way students access educational resources.",
    "Traveling to Europe during the off-season can be a more budget-friendly option.",
    "Training regimens for athletes have become more sophisticated with the use of data analytics.",
    "Streaming services are changing the way people consume television and film content.",
    "Vegan recipes have gained popularity as more people adopt plant-based diets.",
    "Climate change research is critical for developing sustainable environmental policies.",
    "Telemedicine has become increasingly popular due to its convenience and accessibility.",
]
df = cudf.DataFrame({"text": text})
input_dataset = DocumentDataset(dask_cudf.from_cudf(df, npartitions=1))
write_to_filename = False

# Alternatively, read existing directory of JSONL files
# input_file_path="/input_data_dir/"
# input_dataset = DocumentDataset.read_json(
#     input_file_path, backend="cudf", add_filename=True
# )
# write_to_filename = True

In [None]:
classifier = FineWebNemotronEduClassifier(batch_size=1024)

# Run the  Classifier

Dask operations are lazy, so the the classifier will not run until we call an eager operation like `to_json`, `compute`, or `persist`. 

In [7]:
%%time

result_dataset = classifier(dataset=input_dataset)
result_dataset.to_json(output_path=output_file_path, write_to_filename=write_to_filename)

Starting FineWeb Nemotron-4 Edu Classifier inference


GPU: tcp://127.0.0.1:33569, Part: 0: 100%|██████████| 10/10 [00:01<00:00,  5.04it/s]

Writing to disk complete for 1 partition(s)
CPU times: user 1.35 s, sys: 172 ms, total: 1.52 s
Wall time: 14.8 s


GPU: tcp://127.0.0.1:33569, Part: 0: 100%|██████████| 10/10 [00:02<00:00,  3.73it/s]


# Inspect the Output

In [8]:
output_dataset = DocumentDataset.read_json(output_file_path, backend="cudf", add_filename=write_to_filename)
output_dataset.head()

Reading 1 files with blocksize='1gb' / files_per_partition=None


Unnamed: 0,fineweb-nemotron-edu-score,fineweb-nemotron-edu-score-int,fineweb-nemotron-edu-score-label,text
0,1.392578,1,low_quality,Quantum computing is set to revolutionize the ...
1,0.889648,1,low_quality,Investing in index funds is a popular strategy...
2,1.34375,1,low_quality,Recent advancements in gene therapy offer new ...
3,1.731445,2,low_quality,Online learning platforms have transformed the...
4,0.248535,0,low_quality,Traveling to Europe during the off-season can ...
