Skip to content

Add metrics for ScoreFilter benchmarks#1385

Merged
sarahyurick merged 4 commits intoNVIDIA-NeMo:mainfrom
sarahyurick:score_filter_benchmark_metrics
Jan 16, 2026
Merged

Add metrics for ScoreFilter benchmarks#1385
sarahyurick merged 4 commits intoNVIDIA-NeMo:mainfrom
sarahyurick:score_filter_benchmark_metrics

Conversation

@sarahyurick
Copy link
Contributor

Closes #1381.

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
Comment on lines 255 to 263
requirements:
# ensure the total number of documents processed is correct
- metric: num_documents_processed
min_value: 2119489
max_value: 2119489
# account for stochastic filters
- metric: num_kept_documents
min_value: 2090470
max_value: 2090490
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This corresponds with the dataset at /raid/prospector-lm/clean/tinystories_train_parquet.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 16, 2026

Greptile Summary

This PR adds metrics tracking for ScoreFilter benchmarks to address issue #1381. The changes track both the number of documents processed and the number of documents kept after filtering.

Key changes:

  • Modified score_filter_benchmark.py to extract num_kept_documents from the final stage performance stats (_stage_perf[-1].num_items_processed)
  • Updated nightly-benchmark.yaml to include num_kept_documents and throughput_docs_per_sec in Slack sink metrics
  • Added validation requirements to ensure benchmarks track expected document counts (2,119,489 processed, ~2,090,480 kept)
  • Added throughput requirements to validate performance (19,000 docs/sec for ray_data, 8,500 docs/sec for xenna)

The implementation correctly leverages the stage performance tracking system where _stage_perf[1] captures documents read by the ParquetReaderStage and _stage_perf[-1] captures documents written by the JsonlWriter after all filters have been applied.

Confidence Score: 5/5

  • This PR is safe to merge with no issues found
  • The changes are straightforward and correctly implement metrics tracking using existing stage performance infrastructure. The code properly accesses _stage_perf[1] for initial document count and _stage_perf[-1] for filtered document count. YAML configuration adds appropriate validation requirements with reasonable value ranges to account for stochastic filters. No bugs or security issues identified.
  • No files require special attention

Important Files Changed

Filename Overview
benchmarking/scripts/score_filter_benchmark.py Added num_kept_documents metric to track documents that passed all filters, using _stage_perf[-1].num_items_processed for the final count.
benchmarking/nightly-benchmark.yaml Added metrics tracking (num_kept_documents, throughput_docs_per_sec) to Slack sink and validation requirements for both benchmark entries.

Sequence Diagram

sequenceDiagram
    participant Benchmark as Benchmark Script
    participant Pipeline as ScoreFilter Pipeline
    participant Reader as File Reader Stage
    participant Filters as ScoreFilter Stages (x29)
    participant Writer as JsonlWriter Stage
    participant Metrics as Metrics Collection

    Benchmark->>Pipeline: run(executor)
    Pipeline->>Reader: process files (_stage_perf[1])
    Reader->>Reader: Read 2,119,489 documents
    Reader-->>Filters: Pass documents

    loop For each ScoreFilter stage
        Filters->>Filters: Apply heuristic filter
        Filters->>Filters: Filter out documents that fail
    end

    Filters-->>Writer: Pass ~2,090,480 documents (_stage_perf[-1])
    Writer->>Writer: Write filtered documents
    Writer-->>Pipeline: Return output_tasks

    Pipeline-->>Benchmark: output_tasks with _stage_perf

    Benchmark->>Metrics: Extract num_documents_processed
    Note over Metrics: sum(task._stage_perf[1].num_items_processed)
    
    Benchmark->>Metrics: Extract num_kept_documents
    Note over Metrics: sum(task._stage_perf[-1].num_items_processed)

    Benchmark->>Metrics: Calculate throughput
    Note over Metrics: num_documents_processed / run_time_taken

    Metrics-->>Benchmark: Return comprehensive metrics
    Benchmark->>Benchmark: Validate against requirements in YAML
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (1)

  1. benchmarking/scripts/score_filter_benchmark.py, line 114 (link)

    logic: num_kept_documents not initialized in exception handler but used on line 128

2 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
- metric: num_kept_documents
min_value: 2090470
max_value: 2090490

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another requirement to add here would be to catch perf regressions with ray data / xenna would be throughput_docs_per_sec which for ray data is 19,808 and xenna 8,716.. you can multiply that by 0.95 to get a 5% buffer.. so I'd say

Suggested change
requirements:
# Observed throughput of 19,800 docs/sec so we allow a 5% buffer to account for variability
- metric: throughput_docs_per_sec
min_value: 18810

Copy link
Contributor Author

@sarahyurick sarahyurick Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Going to push slightly different numbers than what you listed. The numbers I was actually seeing were ~20k and ~9k. LMK what you think.

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
@sarahyurick sarahyurick merged commit 2ac59c8 into NVIDIA-NeMo:main Jan 16, 2026
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Track number of filtered/kept documents per DocumentFilter

2 participants