# Distributed Data Classification with NeMo Curator's `FineWebEduClassifier`

This notebook demonstrates the use of NeMo Curator's `FineWebEduClassifier`. The [FineWeb-Edu classifier](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier) is used for judging the educational value of web pages. It helps with data annotation, which is useful in data blending for foundation model training. Please refer to the Hugging Face page for more information about the FineWeb-Edu classifier here: https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier.

This tutorial requires at least 1 NVIDIA GPU with:
  - Volta™ or higher (compute capability 7.0+)
  - CUDA 12.x

Before running this notebook, please see this [Installation Guide](https://docs.nvidia.com/nemo/curator/latest/admin/installation.html#admin-installation) page for instructions on how to install NeMo Curator. Be sure to use an installation method which includes GPU dependencies.

In [None]:
# Silence Curator logs via Loguru
import os

os.environ["LOGURU_LEVEL"] = "ERROR"

The following imports are required for this tutorial:

In [None]:
import shutil

import pandas as pd

from nemo_curator.core.client import RayClient
from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.text.classifiers import FineWebEduClassifier
from nemo_curator.stages.text.io.reader.jsonl import JsonlReader
from nemo_curator.stages.text.io.writer.jsonl import JsonlWriter

To run a pipeline in NeMo Curator, we must start a Ray cluster. This can be done manually (see the [Ray documentation](https://docs.ray.io/en/latest/ray-core/starting-ray.html)) or with Curator's `RayClient`:

In [3]:
try:
    ray_client = RayClient()
    ray_client.start()
except Exception as e:
    msg = f"Error initializing Ray client: {e}"
    raise RuntimeError(msg) from e

# Initialize Read, Classification, and Write Stages

Functions in NeMo Curator are called stages. For this tutorial, we will initialize 3 stages: a JSONL file reader, a FineWeb-Edu classification stage, and a JSONL file writer.

For this tutorial, we will create a sample JSONL file to use:

In [4]:
input_file_path = "./input_data_dir"

# Create sample dataset for the tutorial
text = [
    "Quantum computing is set to revolutionize the field of cryptography.",
    "Investing in index funds is a popular strategy for long-term financial growth.",
    "Recent advancements in gene therapy offer new hope for treating genetic disorders.",
    "Online learning platforms have transformed the way students access educational resources.",
    "Traveling to Europe during the off-season can be a more budget-friendly option.",
    "Training regimens for athletes have become more sophisticated with the use of data analytics.",
    "Streaming services are changing the way people consume television and film content.",
    "Vegan recipes have gained popularity as more people adopt plant-based diets.",
    "Climate change research is critical for developing sustainable environmental policies.",
    "Telemedicine has become increasingly popular due to its convenience and accessibility.",
]
df = pd.DataFrame({"text": text})

try:
    os.makedirs(input_file_path, exist_ok=True)
    df.to_json(input_file_path + "/data.jsonl", orient="records", lines=True)
except Exception as e:
    msg = f"Error creating input file: {e}"
    raise RuntimeError(msg) from e

We can define the reader stage with:

In [5]:
# Read existing directory of JSONL files
read_stage = JsonlReader(input_file_path, files_per_partition=1)

The classifier stage is broken down under the hood into a tokenizer stage and a model inference stage. Tokenization is run on the CPU while model inference is run on the GPU. This means that behind the scenes, the `FineWebEduClassifier` stage is actually being broken down into 2 stages (some parameters and details omitted to avoid complexity, please refer to the documentation for more details):

```python
class TokenizerStage:
    self._resources = Resources(cpus=1)
    self.model_identifier = "HuggingFaceFW/fineweb-edu-classifier"
    self.text_field = "text"
    self.padding_side = "right"
    ...
class ModelStage:
    self._resources = Resources(cpus=1, gpus=1)
    self.model_identifier = "HuggingFaceFW/fineweb-edu-classifier"
    self.model_inference_batch_size = 256
    ...
```

Optionally, the classifier predictions may be filtered to include only texts with values listed in `filter_by`. If the `filter_by` parameter is set, then a third stage is added:

```python
def filter_by_category(self, value: str) -> bool:
    return value in self.filter_by

...

if self.filter_by is not None and len(self.filter_by) > 0:
    self.stages.append(Filter(filter_fn=self.filter_by_category, filter_field=...))
```

Since the FineWeb-Edu classifier outputs a floating point value ranging from 0 to 5, Curator labels samples with a value of 2.5 or higher as `high_quality` and samples with a value lower than 2.5 as `low_quality`.

In [None]:
# Initialize the FineWeb-Edu classifier
classifier_stage = FineWebEduClassifier()

# If desired, you may filter your dataset with:
# classifier_stage = FineWebEduClassifier(filter_by=["high_quality"])  # noqa: ERA001
# or
# classifier_stage = FineWebEduClassifier(filter_by=["low_quality"])  # noqa: ERA001

Finally, we can define a stage for writing the results:

In [None]:
# Write results to a directory
output_file_path = "./fineweb_edu_classifier_results"

# Remove the output directory if it already existed, to ensure the correct output is written
if os.path.exists(output_file_path):
    shutil.rmtree(output_file_path)

write_stage = JsonlWriter(output_file_path)

# Initialize Pipeline

In NeMo Curator, we use pipelines to run distributed data workflows using Ray. Pipelines take care of resource allocation and autoscaling to achieve enhanced performance and minimize GPU idleness.

For the distributed data classifiers, we are able to achieve speedups by ensuring that model inference is run in parallel across all available GPUs, while other stages such as I/O, tokenization, and filtering are run across all available CPUs. This is possible because Curator pipelines are composable, which allows each stage in a pipeline to run independently and with its own specified hardware resources.

In [8]:
classifier_pipeline = Pipeline(name="classifier_pipeline", description="Run a classifier pipeline")

# Add stages to the pipeline
classifier_pipeline.add_stage(read_stage)
classifier_pipeline.add_stage(classifier_stage)
classifier_pipeline.add_stage(write_stage)

Pipeline(name='classifier_pipeline', stages=[jsonl_reader(JsonlReader), fineweb_edu_classifier_classifier(FineWebEduClassifier), jsonl_writer(JsonlWriter)])

Composability is also what allows a classifier to sit between pre-processing and post-processing stages. Typical text pre-processing add-ons include text normalization (lowercasing, URL/email removal, Unicode cleanup) and language identification and filtering (to keep only target languages). A full pipeline may look something like:

```python
pipeline = Pipeline(name="full_pipeline")
pipeline.add_stage(read_stage)                # reader (JSONL/S3/etc.)
pipeline.add_stage(lang_id_stage)             # optional: language filter
pipeline.add_stage(classifier_stage)          # classifier
pipeline.add_stage(write_stage)               # writer (JSONL/Parquet)
```

# Run the  Classifier

Let's run the full classifier pipeline:

In [None]:
# Run the pipeline
result = classifier_pipeline.run()

Since the pipeline ran to completion and the result was written to a JSONL file, we can shut down the Ray cluster with:

In [None]:
try:
    ray_client.stop()
except Exception as e:  # noqa: BLE001
    print(f"Error stopping Ray client: {e}")

# Inspect the Output

The write stage returns a list of written files. We can read the output file as a Pandas DataFrame for inspection.

In [None]:
# Retrieval indices here assume that the return shape matches the example data provided by this tutorial
# If you are using your own input dataset, you should inspect the result object yourself
result_file = result[0].data[0]

result_df = pd.read_json(result_file, lines=True)
result_df.head()

Unnamed: 0,text,fineweb-edu-score-float,fineweb-edu-score-int,fineweb-edu-score-label
0,Quantum computing is set to revolutionize the ...,1.466797,1,low_quality
1,Investing in index funds is a popular strategy...,0.481689,0,low_quality
2,Recent advancements in gene therapy offer new ...,1.375,1,low_quality
3,Online learning platforms have transformed the...,1.234375,1,low_quality
4,Traveling to Europe during the off-season can ...,0.135498,0,low_quality


We can see that the predictions were generated as expected.