Add metrics for `ScoreFilter` benchmarks by sarahyurick · Pull Request #1385 · NVIDIA-NeMo/Curator

sarahyurick · 2026-01-16T22:14:10Z

Closes #1381.

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

sarahyurick · 2026-01-16T22:15:00Z

benchmarking/nightly-benchmark.yaml

+    requirements:
+      # ensure the total number of documents processed is correct
+      - metric: num_documents_processed
+        min_value: 2119489
+        max_value: 2119489
+    # account for stochastic filters
+      - metric: num_kept_documents
+        min_value: 2090470
+        max_value: 2090490


This corresponds with the dataset at /raid/prospector-lm/clean/tinystories_train_parquet.

greptile-apps · 2026-01-16T22:17:42Z

Greptile Summary

This PR adds metrics tracking for ScoreFilter benchmarks to address issue #1381. The changes track both the number of documents processed and the number of documents kept after filtering.

Key changes:

Modified score_filter_benchmark.py to extract num_kept_documents from the final stage performance stats (_stage_perf[-1].num_items_processed)
Updated nightly-benchmark.yaml to include num_kept_documents and throughput_docs_per_sec in Slack sink metrics
Added validation requirements to ensure benchmarks track expected document counts (2,119,489 processed, ~2,090,480 kept)
Added throughput requirements to validate performance (19,000 docs/sec for ray_data, 8,500 docs/sec for xenna)

The implementation correctly leverages the stage performance tracking system where _stage_perf[1] captures documents read by the ParquetReaderStage and _stage_perf[-1] captures documents written by the JsonlWriter after all filters have been applied.

Confidence Score: 5/5

This PR is safe to merge with no issues found
The changes are straightforward and correctly implement metrics tracking using existing stage performance infrastructure. The code properly accesses _stage_perf[1] for initial document count and _stage_perf[-1] for filtered document count. YAML configuration adds appropriate validation requirements with reasonable value ranges to account for stochastic filters. No bugs or security issues identified.
No files require special attention

Important Files Changed

Filename	Overview
benchmarking/scripts/score_filter_benchmark.py	Added `num_kept_documents` metric to track documents that passed all filters, using `_stage_perf[-1].num_items_processed` for the final count.
benchmarking/nightly-benchmark.yaml	Added metrics tracking (`num_kept_documents`, `throughput_docs_per_sec`) to Slack sink and validation requirements for both benchmark entries.

Sequence Diagram

sequenceDiagram
    participant Benchmark as Benchmark Script
    participant Pipeline as ScoreFilter Pipeline
    participant Reader as File Reader Stage
    participant Filters as ScoreFilter Stages (x29)
    participant Writer as JsonlWriter Stage
    participant Metrics as Metrics Collection

    Benchmark->>Pipeline: run(executor)
    Pipeline->>Reader: process files (_stage_perf[1])
    Reader->>Reader: Read 2,119,489 documents
    Reader-->>Filters: Pass documents

    loop For each ScoreFilter stage
        Filters->>Filters: Apply heuristic filter
        Filters->>Filters: Filter out documents that fail
    end

    Filters-->>Writer: Pass ~2,090,480 documents (_stage_perf[-1])
    Writer->>Writer: Write filtered documents
    Writer-->>Pipeline: Return output_tasks

    Pipeline-->>Benchmark: output_tasks with _stage_perf

    Benchmark->>Metrics: Extract num_documents_processed
    Note over Metrics: sum(task._stage_perf[1].num_items_processed)
    
    Benchmark->>Metrics: Extract num_kept_documents
    Note over Metrics: sum(task._stage_perf[-1].num_items_processed)

    Benchmark->>Metrics: Calculate throughput
    Note over Metrics: num_documents_processed / run_time_taken

    Metrics-->>Benchmark: Return comprehensive metrics
    Benchmark->>Benchmark: Validate against requirements in YAML

greptile-apps

Additional Comments (1)

benchmarking/scripts/score_filter_benchmark.py, line 114 (link)

logic: num_kept_documents not initialized in exception handler but used on line 128

_{2 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

benchmarking/nightly-benchmark.yaml

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

praateekmahajan · 2026-01-16T22:25:55Z

benchmarking/nightly-benchmark.yaml

+      - metric: num_kept_documents
+        min_value: 2090470
+        max_value: 2090490



Another requirement to add here would be to catch perf regressions with ray data / xenna would be throughput_docs_per_sec which for ray data is 19,808 and xenna 8,716.. you can multiply that by 0.95 to get a 5% buffer.. so I'd say

Suggested change

requirements:

# Observed throughput of 19,800 docs/sec so we allow a 5% buffer to account for variability

- metric: throughput_docs_per_sec

min_value: 18810

Sure. Going to push slightly different numbers than what you listed. The numbers I was actually seeing were ~20k and ~9k. LMK what you think.

benchmarking/nightly-benchmark.yaml

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

Add metrics for ScoreFilter benchmarks

d95d683

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

sarahyurick requested review from praateekmahajan and rlratzel January 16, 2026 22:14

sarahyurick commented Jan 16, 2026

View reviewed changes

greptile-apps bot reviewed Jan 16, 2026

View reviewed changes

greptile suggestion

23aace4

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

praateekmahajan reviewed Jan 16, 2026

View reviewed changes

benchmarking/nightly-benchmark.yaml Outdated Show resolved Hide resolved

use exact_value

45ef538

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

praateekmahajan reviewed Jan 16, 2026

View reviewed changes

benchmarking/nightly-benchmark.yaml Show resolved Hide resolved

add more things

992ca0c

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

praateekmahajan approved these changes Jan 16, 2026

View reviewed changes

sarahyurick merged commit 2ac59c8 into NVIDIA-NeMo:main Jan 16, 2026
18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add metrics for `ScoreFilter` benchmarks#1385

Add metrics for `ScoreFilter` benchmarks#1385
sarahyurick merged 4 commits intoNVIDIA-NeMo:mainfrom
sarahyurick:score_filter_benchmark_metrics

sarahyurick commented Jan 16, 2026

Uh oh!

sarahyurick Jan 16, 2026

Uh oh!

greptile-apps bot commented Jan 16, 2026 •

edited

Loading

Uh oh!

greptile-apps bot left a comment •

edited

Loading

Uh oh!

Uh oh!

praateekmahajan Jan 16, 2026

Uh oh!

sarahyurick Jan 16, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

+    requirements:
+      # Observed throughput of 19,800 docs/sec so we allow a 5% buffer to account for variability
+      - metric: throughput_docs_per_sec
+        min_value: 18810

Conversation

sarahyurick commented Jan 16, 2026

Uh oh!

sarahyurick Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Additional Comments (1)

Uh oh!

Uh oh!

praateekmahajan Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

sarahyurick Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

greptile-apps bot commented Jan 16, 2026 •

edited

Loading

greptile-apps bot left a comment •

edited

Loading

sarahyurick Jan 16, 2026 •

edited

Loading