Skip to content

Conversation

@BlueCrescent
Copy link
Collaborator

  • Included an example configuration file.
  • Added datatrove and pydantic-settings to requirements.
  • Note that modalities is also required for the pipeline to work, but it is not included in the requirements file.

…ized data using scores.

- Included an example configuration file.
- Added datatrove and pydantic-settings to requirements.
- Note that modalities is also required for the pipeline to work, but it is not included in the requirements file.
@BlueCrescent BlueCrescent requested a review from Copilot July 25, 2025 08:39
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements a data filtering pipeline using datatrove for filtering tokenized data based on scores. The pipeline processes JSONL files containing scores for data samples and filters corresponding tokenized datasets based on configurable thresholds.

  • Adds a complete datatrove-based filtering pipeline with score parsing and data filtering components
  • Introduces configuration management using pydantic-settings for both local and Slurm execution environments
  • Updates dependencies to include datatrove and pydantic-settings

Reviewed Changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
src/ml_filter/data_processing/score_based_filtering/step_score_parsing.py Implements ScoresParser class for reading JSONL score files and mapping to tokenized data
src/ml_filter/data_processing/score_based_filtering/step_data_filtering.py Implements DataFiltering class for filtering datasets based on score thresholds
src/ml_filter/data_processing/score_based_filtering/filter_pipeline.py Main pipeline orchestration with configuration management and execution settings
pyproject.toml Adds datatrove and pydantic-settings dependencies
configs/data_processing/example_filter_pipeline_config.yaml Example configuration file for the filtering pipeline
Comments suppressed due to low confidence (1)

src/ml_filter/data_processing/score_based_filtering/filter_pipeline.py:241

  • [nitpick] The error message could be more helpful by providing an example of how to use the FilterPipelineBuilder class directly or where to find documentation.
            "and use the FilterPipelineBuilder class directly."

tasks: int = 1
time: str = "00:15:00"
partition: str = "default"
account: str | None = None # FIXME is this supported?
Copy link

Copilot AI Jul 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The FIXME comment indicates uncertainty about whether the 'account' parameter is supported. This should be resolved or documented properly rather than left as a FIXME in production code.

Suggested change
account: str | None = None # FIXME is this supported?
account: str | None = None # The Slurm account to charge for the job. Optional.

Copilot uses AI. Check for mistakes.
@ajude2s ajude2s requested a review from AbasKhan October 29, 2025 20:47
@ajude2s ajude2s self-assigned this Nov 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants