[benchmark] Add FastText filter benchmarking script (#1411)#1452
Conversation
Greptile SummaryThis PR adds a new FastText filter benchmarking script (
Confidence Score: 4/5
Important Files Changed
Flowchartflowchart TD
A[CLI Arguments] --> B[main]
B --> C[run_fasttext_filter_benchmark]
C --> D[setup_executor]
C --> E[Build Hydra overrides list]
E --> F[load_hydra_yaml]
F --> G[compose DictConfig]
G --> H[create_pipeline_from_yaml]
H --> I["Pipeline with stages:\n0: ParquetReader\n1: ScoreFilter(FastTextLangId)\n2: ScoreFilter(FastTextQualityFilter)\n3: JsonlWriter"]
I --> J[pipeline.run executor]
J --> K{Success?}
K -->|Yes| L[Collect metrics from _stage_perf]
K -->|No| M[Set failure metrics]
L --> N[write_benchmark_results]
M --> N
N --> O[Return exit code]
Last reviewed commit: 2eea7f8 |
- Add fasttext_filter_benchmark.py script following the pattern from score_filter_benchmark.py - Add fasttext_filter_raydata and fasttext_filter_xenna entries to nightly-benchmark.yaml - Supports FastText language ID and quality filters with model setup requirements Fixes NVIDIA-NeMo#1411 Signed-off-by: Kunal Sachdev <kunalmgsachdev@gmail.com>
2b52542 to
c2ba0da
Compare
…onfig (NVIDIA-NeMo#1411) - Add separate dataset entries for FastText langid and quality models - Pass FastText model paths as explicit CLI arguments to benchmarks - Remove hardcoded model paths from Hydra overrides - Update FastText filter benchmarks to use model_weights_path - Align arxiv E2E benchmark arg naming with FastText langid usage Signed-off-by: Kunal Sachdev <kunalmgsachdev@gmail.com>
c2ba0da to
de0cec9
Compare
benchmarking/nightly-benchmark.yaml
Outdated
| - name: fasttext_filter_raydata | ||
| enabled: true | ||
| script: fasttext_filter_benchmark.py | ||
| args: >- | ||
| --benchmark-results-path={session_entry_dir} | ||
| --output-path={session_entry_dir}/scratch/output | ||
| --executor=ray_data | ||
| --input-path={dataset:tinystories,parquet} | ||
| --yaml-config={curator_repo_dir}/nemo_curator/config/text/fasttext_filter_pipeline.yaml | ||
| --fasttext-langid-model-path={dataset:fasttext_langid_model,bin} | ||
| --fasttext-quality-model-path={dataset:fasttext_quality_model,bin} | ||
| --overrides="stages.0._target_=nemo_curator.stages.text.io.reader.ParquetReader" | ||
| timeout_s: 400 | ||
| sink_data: | ||
| - name: slack | ||
| additional_metrics: | ||
| - num_kept_documents | ||
| - throughput_docs_per_sec | ||
| ray: | ||
| num_cpus: 64 | ||
| num_gpus: 0 | ||
| enable_object_spilling: false | ||
|
|
||
| - name: fasttext_filter_xenna | ||
| enabled: true | ||
| script: fasttext_filter_benchmark.py | ||
| args: >- | ||
| --benchmark-results-path={session_entry_dir} | ||
| --output-path={session_entry_dir}/scratch/output | ||
| --executor=xenna | ||
| --input-path={dataset:tinystories,parquet} | ||
| --yaml-config={curator_repo_dir}/nemo_curator/config/text/fasttext_filter_pipeline.yaml | ||
| --fasttext-langid-model-path={dataset:fasttext_langid_model,bin} | ||
| --fasttext-quality-model-path={dataset:fasttext_quality_model,bin} | ||
| --overrides="stages.0._target_=nemo_curator.stages.text.io.reader.ParquetReader" |
There was a problem hiding this comment.
[P1] New FastText benchmark entries lack requirements, so regressions won’t be caught by nightly.
Most existing entries define a requirements: section to enforce throughput and/or data-integrity expectations. fasttext_filter_raydata and fasttext_filter_xenna currently only report metrics to Slack, so they’ll run but won’t fail the nightly job on major performance or correctness changes. If baseline metrics are known (or can be captured), adding minimal requirements (e.g., exact num_documents_processed and a conservative min throughput) would make these benchmarks actionable.
sarahyurick
left a comment
There was a problem hiding this comment.
Added a few minor requests. Testing right now with:
results_path: /path/to/fasttext-results
datasets_path: /path/to/datasets
model_weights_path: /path/to/model_weights
datasets:
- name: "tinystories"
formats:
- type: "parquet"
path: "{datasets_path}/tinystories_train_parquet"
- name: "fasttext_langid_model"
formats:
- type: "bin"
path: "{model_weights_path}/fasttext/lid.176.bin"
- type: "ftz"
path: "{model_weights_path}/fasttext/lid.176.ftz"
- name: "fasttext_quality_model"
formats:
- type: "bin"
path: "{model_weights_path}/fasttext/model.bin"
default_timeout_s: 7200
delete_scratch: true
entries:
- name: fasttext_filter_raydata
enabled: true
script: fasttext_filter_benchmark.py
args: >-
--benchmark-results-path={session_entry_dir}
--output-path={session_entry_dir}/scratch/output
--executor=ray_data
--input-path={dataset:tinystories,parquet}
--yaml-config={curator_repo_dir}/nemo_curator/config/text/fasttext_filter_pipeline.yaml
--fasttext-langid-model-path={dataset:fasttext_langid_model,bin}
--fasttext-quality-model-path={dataset:fasttext_quality_model,bin}
--overrides="stages.0._target_=nemo_curator.stages.text.io.reader.ParquetReader"
timeout_s: 400
sink_data:
- name: slack
additional_metrics:
- num_kept_documents
- throughput_docs_per_sec
ray:
num_cpus: 64
num_gpus: 0
enable_object_spilling: false
- name: fasttext_filter_xenna
enabled: true
script: fasttext_filter_benchmark.py
args: >-
--benchmark-results-path={session_entry_dir}
--output-path={session_entry_dir}/scratch/output
--executor=xenna
--input-path={dataset:tinystories,parquet}
--yaml-config={curator_repo_dir}/nemo_curator/config/text/fasttext_filter_pipeline.yaml
--fasttext-langid-model-path={dataset:fasttext_langid_model,bin}
--fasttext-quality-model-path={dataset:fasttext_quality_model,bin}
--overrides="stages.0._target_=nemo_curator.stages.text.io.reader.ParquetReader"
timeout_s: 600
sink_data:
- name: slack
additional_metrics:
- num_kept_documents
- throughput_docs_per_sec
ray:
num_cpus: 64
num_gpus: 0
enable_object_spilling: false
…htly-benchmark.yaml basis Sarah Yurick's test run (NVIDIA-NeMo#1411) Signed-off-by: Kunal Sachdev <kunalmgsachdev@gmail.com>
d2b1f3a to
5a0a25a
Compare
…ly-benchmark.yaml basis Sarah Yurick's test run (NVIDIA-NeMo#1411) Co-authored-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> Signed-off-by: Kunal Sachdev <kunalmgsachdev@gmail.com>
…el.bin in benchmarking/nightly-benchmark.yaml (NVIDIA-NeMo#1411) Co-authored-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> Signed-off-by: Kunal Sachdev <kunalmgsachdev@gmail.com>
…nchmarking/nightly-benchmark.yaml (NVIDIA-NeMo#1411) Co-authored-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> Signed-off-by: Kunal Sachdev <kunalmgsachdev@gmail.com>
… after ScoreFilter benchmarks in benchmarking/nightly-benchmark.yaml (NVIDIA-NeMo#1411) Signed-off-by: Kunal Sachdev <kunalmgsachdev@gmail.com>
Signed-off-by: Kunal Sachdev <kunalmgsachdev@gmail.com>
Description
This PR adds a benchmarking script for FastText-based document filters (language ID and quality) to the NeMo Curator benchmarking framework. The implementation follows the same pattern as the existing
score_filter_benchmark.pyscript.Changes:
fasttext_filter_benchmark.pyscript that benchmarks FastText filters using a Hydra-configured pipelinefasttext_filter_raydataandfasttext_filter_xennaentries tonightly-benchmark.yamlfor both executorsThe script handles FastText filters that require
setup()for model loading, which differentiates them from heuristic filters.Related Issue: #1411
Questions for Discussion
I have a couple of questions posted in the issue comments (#1411) regarding:
requirementssections for the FastText benchmarks now, or add them in a follow-up PR after establishing baseline metrics?{datasets_path}/models/fasttext/lid.176.binand{datasets_path}/models/fasttext/quality.bin) are acceptable.Usage
The benchmark can be run via the benchmarking framework:
Or directly:
python benchmarking/scripts/fasttext_filter_benchmark.py \ --benchmark-results-path /path/to/results \ --input-path /path/to/input \ --yaml-config nemo_curator/config/text/fasttext_filter_pipeline.yaml \ --executor ray_data \ --overrides "fasttext_langid_model_path=/path/to/lid.176.bin, fasttext_quality_model_path=/path/to/quality.bin"Checklist