-
Notifications
You must be signed in to change notification settings - Fork 0
Added a datatrove based pipeline for filtering tokenized data using scores. #235
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
BlueCrescent
commented
Jul 25, 2025
- Included an example configuration file.
- Added datatrove and pydantic-settings to requirements.
- Note that modalities is also required for the pipeline to work, but it is not included in the requirements file.
…ized data using scores. - Included an example configuration file. - Added datatrove and pydantic-settings to requirements. - Note that modalities is also required for the pipeline to work, but it is not included in the requirements file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements a data filtering pipeline using datatrove for filtering tokenized data based on scores. The pipeline processes JSONL files containing scores for data samples and filters corresponding tokenized datasets based on configurable thresholds.
- Adds a complete datatrove-based filtering pipeline with score parsing and data filtering components
- Introduces configuration management using pydantic-settings for both local and Slurm execution environments
- Updates dependencies to include datatrove and pydantic-settings
Reviewed Changes
Copilot reviewed 5 out of 6 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| src/ml_filter/data_processing/score_based_filtering/step_score_parsing.py | Implements ScoresParser class for reading JSONL score files and mapping to tokenized data |
| src/ml_filter/data_processing/score_based_filtering/step_data_filtering.py | Implements DataFiltering class for filtering datasets based on score thresholds |
| src/ml_filter/data_processing/score_based_filtering/filter_pipeline.py | Main pipeline orchestration with configuration management and execution settings |
| pyproject.toml | Adds datatrove and pydantic-settings dependencies |
| configs/data_processing/example_filter_pipeline_config.yaml | Example configuration file for the filtering pipeline |
Comments suppressed due to low confidence (1)
src/ml_filter/data_processing/score_based_filtering/filter_pipeline.py:241
- [nitpick] The error message could be more helpful by providing an example of how to use the FilterPipelineBuilder class directly or where to find documentation.
"and use the FilterPipelineBuilder class directly."
src/ml_filter/data_processing/score_based_filtering/step_data_filtering.py
Show resolved
Hide resolved
src/ml_filter/data_processing/score_based_filtering/step_data_filtering.py
Show resolved
Hide resolved
src/ml_filter/data_processing/score_based_filtering/step_score_parsing.py
Outdated
Show resolved
Hide resolved
src/ml_filter/data_processing/score_based_filtering/step_score_parsing.py
Show resolved
Hide resolved
| tasks: int = 1 | ||
| time: str = "00:15:00" | ||
| partition: str = "default" | ||
| account: str | None = None # FIXME is this supported? |
Copilot
AI
Jul 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The FIXME comment indicates uncertainty about whether the 'account' parameter is supported. This should be resolved or documented properly rather than left as a FIXME in production code.
| account: str | None = None # FIXME is this supported? | |
| account: str | None = None # The Slurm account to charge for the job. Optional. |
src/ml_filter/data_processing/score_based_filtering/filter_pipeline.py
Outdated
Show resolved
Hide resolved
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…g pipeline and adapted the codebase for new changes from main
… execution settings
…dle duplicates in score parsing