Add metrics for ScoreFilter benchmarks#1385
Conversation
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
| requirements: | ||
| # ensure the total number of documents processed is correct | ||
| - metric: num_documents_processed | ||
| min_value: 2119489 | ||
| max_value: 2119489 | ||
| # account for stochastic filters | ||
| - metric: num_kept_documents | ||
| min_value: 2090470 | ||
| max_value: 2090490 |
There was a problem hiding this comment.
This corresponds with the dataset at /raid/prospector-lm/clean/tinystories_train_parquet.
Greptile SummaryThis PR adds metrics tracking for Key changes:
The implementation correctly leverages the stage performance tracking system where Confidence Score: 5/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant Benchmark as Benchmark Script
participant Pipeline as ScoreFilter Pipeline
participant Reader as File Reader Stage
participant Filters as ScoreFilter Stages (x29)
participant Writer as JsonlWriter Stage
participant Metrics as Metrics Collection
Benchmark->>Pipeline: run(executor)
Pipeline->>Reader: process files (_stage_perf[1])
Reader->>Reader: Read 2,119,489 documents
Reader-->>Filters: Pass documents
loop For each ScoreFilter stage
Filters->>Filters: Apply heuristic filter
Filters->>Filters: Filter out documents that fail
end
Filters-->>Writer: Pass ~2,090,480 documents (_stage_perf[-1])
Writer->>Writer: Write filtered documents
Writer-->>Pipeline: Return output_tasks
Pipeline-->>Benchmark: output_tasks with _stage_perf
Benchmark->>Metrics: Extract num_documents_processed
Note over Metrics: sum(task._stage_perf[1].num_items_processed)
Benchmark->>Metrics: Extract num_kept_documents
Note over Metrics: sum(task._stage_perf[-1].num_items_processed)
Benchmark->>Metrics: Calculate throughput
Note over Metrics: num_documents_processed / run_time_taken
Metrics-->>Benchmark: Return comprehensive metrics
Benchmark->>Benchmark: Validate against requirements in YAML
|
There was a problem hiding this comment.
Additional Comments (1)
-
benchmarking/scripts/score_filter_benchmark.py, line 114 (link)logic:
num_kept_documentsnot initialized in exception handler but used on line 128
2 files reviewed, 1 comment
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
| - metric: num_kept_documents | ||
| min_value: 2090470 | ||
| max_value: 2090490 | ||
|
|
There was a problem hiding this comment.
Another requirement to add here would be to catch perf regressions with ray data / xenna would be throughput_docs_per_sec which for ray data is 19,808 and xenna 8,716.. you can multiply that by 0.95 to get a 5% buffer.. so I'd say
| requirements: | |
| # Observed throughput of 19,800 docs/sec so we allow a 5% buffer to account for variability | |
| - metric: throughput_docs_per_sec | |
| min_value: 18810 | |
There was a problem hiding this comment.
Sure. Going to push slightly different numbers than what you listed. The numbers I was actually seeing were ~20k and ~9k. LMK what you think.
Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>
Closes #1381.