Skip to content

Support TFrecord in Grain pipeline#3430

Merged
copybara-service[bot] merged 1 commit intomainfrom
aireen/grain_tfrecord
Mar 20, 2026
Merged

Support TFrecord in Grain pipeline#3430
copybara-service[bot] merged 1 commit intomainfrom
aireen/grain_tfrecord

Conversation

@aireenmei
Copy link
Copy Markdown
Collaborator

@aireenmei aireenmei commented Mar 16, 2026

Description

  • Support grain_file_type=tfrecord in Grain pipeline
  • Add unit test GrainTFRecordProcessingTest
  • Refactor Grain unit test for cleaner code
  • Add docs, the formatting changes in data_input_pipeline.md is made by pre-commit mdformat

Tests

  • unit test
  • test run with command:
python3 -m MaxText.train maxtext/configs/base.yml \
  run_name=${RUN_NAME} \
  base_output_directory=${GCS_BUCKET} \
  dataset_type=grain \
  grain_file_type=tfrecord \
  grain_train_files=gs://maxtext-dataset/c4/en/3.0.1/c4-train.tfrecord-* \
  grain_worker_count=1 \
  tokenizer_type=huggingface \
  tokenizer_path=google-t5/t5-large \
  add_bos=false \
  num_epoch=1 \
  steps=10 \
  enable_checkpointing=false

log:

INFO:absl:completed step: 0, seconds: 32.063, TFLOP/s/device: 5.085, Tokens/s/device: 766.485, total_weights: 97543, loss: 10.871
INFO:absl:To see full metrics 'tensorboard --logdir=gs://aireenmei-multipod/test/grain/grain-2026-03-16-20-30-35/tensorboard/'
INFO:absl:completed step: 1, seconds: 0.432, TFLOP/s/device: 377.451, Tokens/s/device: 56893.498, total_weights: 96426, loss: 10.866
INFO:absl:completed step: 2, seconds: 1.056, TFLOP/s/device: 154.471, Tokens/s/device: 23283.531, total_weights: 96824, loss: 9.716
INFO:absl:completed step: 3, seconds: 0.927, TFLOP/s/device: 175.932, Tokens/s/device: 26518.364, total_weights: 94917, loss: 9.183
INFO:absl:completed step: 4, seconds: 1.202, TFLOP/s/device: 135.602, Tokens/s/device: 20439.394, total_weights: 96144, loss: 8.930
INFO:absl:completed step: 5, seconds: 1.202, TFLOP/s/device: 135.628, Tokens/s/device: 20443.389, total_weights: 96021, loss: 8.782
INFO:absl:completed step: 6, seconds: 1.203, TFLOP/s/device: 135.565, Tokens/s/device: 20433.888, total_weights: 96877, loss: 8.595
INFO:absl:completed step: 7, seconds: 1.203, TFLOP/s/device: 135.544, Tokens/s/device: 20430.609, total_weights: 96736, loss: 8.476
INFO:absl:completed step: 8, seconds: 1.202, TFLOP/s/device: 135.641, Tokens/s/device: 20445.226, total_weights: 95375, loss: 8.465
INFO:absl:completed step: 9, seconds: 1.203, TFLOP/s/device: 135.553, Tokens/s/device: 20432.002, total_weights: 96435, loss: 8.375

In comparison, using the same TFRecord files in tfds pipeline is very slow at the beginning

INFO:absl:completed step: 0, seconds: 27.974, TFLOP/s/device: 5.828, Tokens/s/device: 878.532, total_weights: 81974, loss: 10.873
INFO:absl:To see full metrics 'tensorboard --logdir=gs://aireenmei-multipod/test/tfds/tfds-2026-03-17-23-33-04/tensorboard/'
INFO:absl:completed step: 1, seconds: 6.055, TFLOP/s/device: 26.929, Tokens/s/device: 4059.096, total_weights: 79833, loss: 10.882
INFO:absl:completed step: 2, seconds: 7.814, TFLOP/s/device: 20.867, Tokens/s/device: 3145.257, total_weights: 82948, loss: 9.759
INFO:absl:completed step: 3, seconds: 5.320, TFLOP/s/device: 30.645, Tokens/s/device: 4619.175, total_weights: 75398, loss: 9.151
INFO:absl:completed step: 4, seconds: 5.777, TFLOP/s/device: 28.223, Tokens/s/device: 4254.122, total_weights: 81795, loss: 9.026
INFO:absl:completed step: 5, seconds: 6.034, TFLOP/s/device: 27.020, Tokens/s/device: 4072.730, total_weights: 83248, loss: 8.848
INFO:absl:completed step: 6, seconds: 2.280, TFLOP/s/device: 71.522, Tokens/s/device: 10780.621, total_weights: 79633, loss: 8.697
INFO:absl:completed step: 7, seconds: 0.010, TFLOP/s/device: 16720.905, Tokens/s/device: 2520356.886, total_weights: 79027, loss: 8.530
INFO:absl:completed step: 8, seconds: 1.202, TFLOP/s/device: 135.698, Tokens/s/device: 20453.887, total_weights: 79079, loss: 8.477
INFO:absl:completed step: 9, seconds: 1.202, TFLOP/s/device: 135.611, Tokens/s/device: 20440.771, total_weights: 78018, loss: 8.433

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have run end-to-end tests tests and provided workload links above if applicable.
  • I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 16, 2026

Codecov Report

❌ Patch coverage is 66.66667% with 12 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/maxtext/input_pipeline/input_pipeline_utils.py 55.55% 8 Missing ⚠️
...rc/maxtext/input_pipeline/grain_data_processing.py 77.77% 2 Missing and 2 partials ⚠️

📢 Thoughts on this report? Let us know!

Comment thread src/maxtext/input_pipeline/grain_data_processing.py Outdated
@aireenmei aireenmei force-pushed the aireen/grain_tfrecord branch from 5e0ff74 to 52f8364 Compare March 19, 2026 02:26
@aireenmei aireenmei force-pushed the aireen/grain_tfrecord branch from 52f8364 to 9e36543 Compare March 19, 2026 19:28
copybara-service Bot pushed a commit to google/grain that referenced this pull request Mar 20, 2026
Imported from GitHub PR AI-Hypercomputer/maxtext#3430

# Description

* Support grain_file_type=tfrecord in Grain pipeline
* Add unit test `GrainTFRecordProcessingTest`
* Refactor Grain unit test for cleaner code
* Add docs, the formatting changes in `data_input_pipeline.md` is made by pre-commit mdformat

# Tests

* unit test
* test run with command:
```
python3 -m MaxText.train maxtext/configs/base.yml \
  run_name=${RUN_NAME} \
  base_output_directory=${GCS_BUCKET} \
  dataset_type=grain \
  grain_file_type=tfrecord \
  grain_train_files=gs://maxtext-dataset/c4/en/3.0.1/c4-train.tfrecord-* \
  grain_worker_count=1 \
  tokenizer_type=huggingface \
  tokenizer_path=google-t5/t5-large \
  add_bos=false \
  num_epoch=1 \
  steps=10 \
  enable_checkpointing=false
```
log:
```
INFO:absl:completed step: 0, seconds: 32.063, TFLOP/s/device: 5.085, Tokens/s/device: 766.485, total_weights: 97543, loss: 10.871
INFO:absl:To see full metrics 'tensorboard --logdir=gs://aireenmei-multipod/test/grain/grain-2026-03-16-20-30-35/tensorboard/'
INFO:absl:completed step: 1, seconds: 0.432, TFLOP/s/device: 377.451, Tokens/s/device: 56893.498, total_weights: 96426, loss: 10.866
INFO:absl:completed step: 2, seconds: 1.056, TFLOP/s/device: 154.471, Tokens/s/device: 23283.531, total_weights: 96824, loss: 9.716
INFO:absl:completed step: 3, seconds: 0.927, TFLOP/s/device: 175.932, Tokens/s/device: 26518.364, total_weights: 94917, loss: 9.183
INFO:absl:completed step: 4, seconds: 1.202, TFLOP/s/device: 135.602, Tokens/s/device: 20439.394, total_weights: 96144, loss: 8.930
INFO:absl:completed step: 5, seconds: 1.202, TFLOP/s/device: 135.628, Tokens/s/device: 20443.389, total_weights: 96021, loss: 8.782
INFO:absl:completed step: 6, seconds: 1.203, TFLOP/s/device: 135.565, Tokens/s/device: 20433.888, total_weights: 96877, loss: 8.595
INFO:absl:completed step: 7, seconds: 1.203, TFLOP/s/device: 135.544, Tokens/s/device: 20430.609, total_weights: 96736, loss: 8.476
INFO:absl:completed step: 8, seconds: 1.202, TFLOP/s/device: 135.641, Tokens/s/device: 20445.226, total_weights: 95375, loss: 8.465
INFO:absl:completed step: 9, seconds: 1.203, TFLOP/s/device: 135.553, Tokens/s/device: 20432.002, total_weights: 96435, loss: 8.375
```
In comparison, using the same TFRecord files in tfds pipeline is very slow at the beginning
```
INFO:absl:completed step: 0, seconds: 27.974, TFLOP/s/device: 5.828, Tokens/s/device: 878.532, total_weights: 81974, loss: 10.873
INFO:absl:To see full metrics 'tensorboard --logdir=gs://aireenmei-multipod/test/tfds/tfds-2026-03-17-23-33-04/tensorboard/'
INFO:absl:completed step: 1, seconds: 6.055, TFLOP/s/device: 26.929, Tokens/s/device: 4059.096, total_weights: 79833, loss: 10.882
INFO:absl:completed step: 2, seconds: 7.814, TFLOP/s/device: 20.867, Tokens/s/device: 3145.257, total_weights: 82948, loss: 9.759
INFO:absl:completed step: 3, seconds: 5.320, TFLOP/s/device: 30.645, Tokens/s/device: 4619.175, total_weights: 75398, loss: 9.151
INFO:absl:completed step: 4, seconds: 5.777, TFLOP/s/device: 28.223, Tokens/s/device: 4254.122, total_weights: 81795, loss: 9.026
INFO:absl:completed step: 5, seconds: 6.034, TFLOP/s/device: 27.020, Tokens/s/device: 4072.730, total_weights: 83248, loss: 8.848
INFO:absl:completed step: 6, seconds: 2.280, TFLOP/s/device: 71.522, Tokens/s/device: 10780.621, total_weights: 79633, loss: 8.697
INFO:absl:completed step: 7, seconds: 0.010, TFLOP/s/device: 16720.905, Tokens/s/device: 2520356.886, total_weights: 79027, loss: 8.530
INFO:absl:completed step: 8, seconds: 1.202, TFLOP/s/device: 135.698, Tokens/s/device: 20453.887, total_weights: 79079, loss: 8.477
INFO:absl:completed step: 9, seconds: 1.202, TFLOP/s/device: 135.611, Tokens/s/device: 20440.771, total_weights: 78018, loss: 8.433
```

# Checklist

Before submitting this PR, please make sure (put X in square brackets):
- [x] I have performed a self-review of my code. For an optional AI review, add the `gemini-review` label.
- [x] I have necessary comments in my code, particularly in hard-to-understand areas.
- [x] I have run end-to-end tests tests and provided workload links above if applicable.
- [x] I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in [our documentation](https://maxtext.readthedocs.io/en/latest/development.html#adding-new-documentation-files).

Copybara import of the project:

--
9e365436e30735554772a91ac1e3c2311544b757 by aireenmei <aireenmei@gmail.com>:

Support TFrecord in Grain pipeline

Merging this change closes #3430

PiperOrigin-RevId: 886960115
@copybara-service copybara-service Bot merged commit 41a4e9d into main Mar 20, 2026
26 checks passed
@copybara-service copybara-service Bot deleted the aireen/grain_tfrecord branch March 20, 2026 20:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants