Skip to content

restructure input_pipeline#3124

Merged
copybara-service[bot] merged 1 commit intomainfrom
aireen/input_restructure2
Feb 14, 2026
Merged

restructure input_pipeline#3124
copybara-service[bot] merged 1 commit intomainfrom
aireen/input_restructure2

Conversation

@aireenmei
Copy link
Collaborator

@aireenmei aireenmei commented Feb 12, 2026

Description

Retry the restructure in #3050 which was rolled back due to internal breakage
Restructure the input pipeline folder as follows:
under src/maxtext/input_pipeline:

-packing
-- prefill_packing.py
-- sequence_packing.py
-tokenizer.py
-multihost_dataloading.py
-distillation_data_processing.py (prev _distillation_data_processing.py)
-grain_data_processing.py (_grain_data_processing.py)
-grain_tokenizer.py (_grain_tokenizer.py)
-hf_data_processing.py (_hf_data_processing.py)
-input_pipeline_utils.py (_input_pipeline_utils.py)
-tfds_data_processing.py (_tfds_data_processing.py)
-tfds_data_processing_c4_mlperf.py (_tfds_data_processing_c4_mlperf.py)
-input_pipeline_interface.py
-synthetic_data_processing.py
-instruction_data_processing.py
Makes corresponding changes in imports

Tests

CI test

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have run end-to-end tests tests and provided workload links above if applicable.
  • I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

@codecov
Copy link

codecov bot commented Feb 12, 2026

Codecov Report

❌ Patch coverage is 76.36364% with 13 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...rc/maxtext/input_pipeline/grain_data_processing.py 62.50% 6 Missing ⚠️
src/maxtext/input_pipeline/hf_data_processing.py 61.53% 5 Missing ⚠️
src/MaxText/layers/engram.py 0.00% 1 Missing ⚠️
src/MaxText/rl/train_rl.py 0.00% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

@aireenmei aireenmei force-pushed the aireen/input_restructure2 branch 2 times, most recently from 81ee44b to 8690e4f Compare February 12, 2026 19:44
@github-actions
Copy link

🤖 Hi @aireenmei, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📋 Review Summary

This pull request is a large-scale refactoring of the input_pipeline module, and the changes look solid. The file moves, renames, and import updates are consistent and well-executed.

🔍 General Feedback

  • The restructuring of the input_pipeline into its own maxtext subpackage is a great improvement for code organization.
  • Renaming modules to remove the leading underscore (e.g., _hf_data_processing.py to hf_data_processing.py) improves clarity.
  • The minor code cleanups, like combining imports, are also appreciated.

Overall, this is a good refactoring that improves the structure of the codebase.

Copy link
Collaborator

@bvandermoon bvandermoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks Aireen

@copybara-service copybara-service bot merged commit 1b0c210 into main Feb 14, 2026
103 checks passed
@copybara-service copybara-service bot deleted the aireen/input_restructure2 branch February 14, 2026 02:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants