Added a datatrove based pipeline for filtering tokenized data using scores. #235

BlueCrescent · 2025-07-25T08:39:59Z

Included an example configuration file.
Added datatrove and pydantic-settings to requirements.
Note that modalities is also required for the pipeline to work, but it is not included in the requirements file.

…ized data using scores. - Included an example configuration file. - Added datatrove and pydantic-settings to requirements. - Note that modalities is also required for the pipeline to work, but it is not included in the requirements file.

Copilot

Pull Request Overview

This PR implements a data filtering pipeline using datatrove for filtering tokenized data based on scores. The pipeline processes JSONL files containing scores for data samples and filters corresponding tokenized datasets based on configurable thresholds.

Adds a complete datatrove-based filtering pipeline with score parsing and data filtering components
Introduces configuration management using pydantic-settings for both local and Slurm execution environments
Updates dependencies to include datatrove and pydantic-settings

Reviewed Changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
src/ml_filter/data_processing/score_based_filtering/step_score_parsing.py	Implements ScoresParser class for reading JSONL score files and mapping to tokenized data
src/ml_filter/data_processing/score_based_filtering/step_data_filtering.py	Implements DataFiltering class for filtering datasets based on score thresholds
src/ml_filter/data_processing/score_based_filtering/filter_pipeline.py	Main pipeline orchestration with configuration management and execution settings
pyproject.toml	Adds datatrove and pydantic-settings dependencies
configs/data_processing/example_filter_pipeline_config.yaml	Example configuration file for the filtering pipeline

Comments suppressed due to low confidence (1)

src/ml_filter/data_processing/score_based_filtering/filter_pipeline.py:241

[nitpick] The error message could be more helpful by providing an example of how to use the FilterPipelineBuilder class directly or where to find documentation.

            "and use the FilterPipelineBuilder class directly."

src/ml_filter/data_processing/score_based_filtering/step_data_filtering.py

src/ml_filter/data_processing/score_based_filtering/step_score_parsing.py

Copilot · 2025-07-25T08:41:04Z

src/ml_filter/data_processing/score_based_filtering/filter_pipeline.py

+    tasks: int = 1
+    time: str = "00:15:00"
+    partition: str = "default"
+    account: str | None = None  # FIXME is this supported?


The FIXME comment indicates uncertainty about whether the 'account' parameter is supported. This should be resolved or documented properly rather than left as a FIXME in production code.

Suggested change

account: str | None = None # FIXME is this supported?

account: str | None = None # The Slurm account to charge for the job. Optional.

src/ml_filter/data_processing/score_based_filtering/filter_pipeline.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…g pipeline and adapted the codebase for new changes from main

…tionality

… execution settings

…line.py

…dle duplicates in score parsing

BlueCrescent requested a review from Copilot July 25, 2025 08:39

Copilot AI reviewed Jul 25, 2025

View reviewed changes

BlueCrescent and others added 7 commits July 25, 2025 10:43

chore(filtering): More robust doc id parsing.

81aafa8

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

fix(filtering): Removed duplicate file exists check.

b1d1a46

fix(filtering): fixed docstring

af89182

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Merge branch 'master' into filtering_pipeline

12fbc95

refactor: removed reliance on file hashes in the score-based filterin…

22dddeb

…g pipeline and adapted the codebase for new changes from main

test: add comprehensive tests for score-based filtering pipeline func…

e2d02f2

…tionality

chore: remove hardcoded YAML file path from main execution block

936462a

ajude2s requested a review from AbasKhan October 29, 2025 20:47

feat: add Slurm configuration files for filtering pipeline and update…

6bb08f7

… execution settings

ajude2s self-assigned this Nov 2, 2025

ajude2s added 2 commits November 4, 2025 12:56

refactor: clean up imports and remove unused code in test_filter_pipe…

3a5c21e

…line.py

fix: enhance ScoresParser to preserve original document order and han…

a0698c2

…dle duplicates in score parsing

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Added a datatrove based pipeline for filtering tokenized data using scores. #235

Added a datatrove based pipeline for filtering tokenized data using scores. #235

Uh oh!

BlueCrescent commented Jul 25, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Jul 25, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	account: str \| None = None # FIXME is this supported?
	account: str \| None = None # The Slurm account to charge for the job. Optional.

Added a datatrove based pipeline for filtering tokenized data using scores. #235

Are you sure you want to change the base?

Added a datatrove based pipeline for filtering tokenized data using scores. #235

Uh oh!

Conversation

BlueCrescent commented Jul 25, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Jul 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants