Skip to content

Small improvements to naming / structure in input_pipeline_interface.py #1845

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

NuojCheng
Copy link
Collaborator

@NuojCheng NuojCheng commented Jun 18, 2025

Description

TL;DR: Re-organize input_pipeline_interphase.py for better readability

Following changes are made:

  • Synthetic data iterator and placeholder synthetic data iterator are moved to a separate file
  • BadSyntheticDataIterator is renamed to PlaceHolderDataIterator
  • The if-else commands in the end are re-organized

FIXES: b/421596013

Tests

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • I have performed a self-review of my code.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have run end-to-end tests tests and provided workload links above if applicable.
  • I have made or will make corresponding changes to the doc if needed.

@NuojCheng NuojCheng changed the title Chengnuojin/input pipeline Refactor input pipeline with synthetic data iterator Jun 19, 2025
@NuojCheng NuojCheng force-pushed the chengnuojin/input_pipeline branch 4 times, most recently from 99e1569 to 14c16f1 Compare June 23, 2025 15:51
@NuojCheng NuojCheng changed the title Refactor input pipeline with synthetic data iterator Small improvements to naming / structure in input_pipeline_interface.py Jun 23, 2025
@NuojCheng NuojCheng marked this pull request as ready for review June 23, 2025 17:40
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe name this module as synthethic_data_processing.py

@NuojCheng NuojCheng force-pushed the chengnuojin/input_pipeline branch from 8728f1a to de5ef9c Compare June 24, 2025 16:36
@NuojCheng NuojCheng requested a review from SurbhiJainUSC June 24, 2025 18:17
@NuojCheng NuojCheng force-pushed the chengnuojin/input_pipeline branch from a4bc416 to dd92595 Compare June 24, 2025 20:50
@NuojCheng NuojCheng requested a review from SurbhiJainUSC June 24, 2025 22:04
assert config.packing, "c4_mlperf dataloader only works with packing. For padded version, use tfds dataloader"
train_iterator, eval_iterator = dataset_type_to_train_eval_iterator[config.dataset_type]
else:
max_logging.log(f"WARNING: '{config.dataset_type}' is not a supported dataset type." \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When user specify dataset_type=synthetic, this WARNING message will be confusing. We can exclude that case.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When dataset_type=synthetic, it should not incur this warning message as it exits in line 64.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Sorry I missed that. Thanks!

Copy link
Collaborator

@aireenmei aireenmei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Nuojin!

assert config.packing, "c4_mlperf dataloader only works with packing. For padded version, use tfds dataloader"
train_iterator, eval_iterator = dataset_type_to_train_eval_iterator[config.dataset_type]
else:
max_logging.log(f"WARNING: '{config.dataset_type}' is not a supported dataset type." \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Sorry I missed that. Thanks!

@NuojCheng NuojCheng force-pushed the chengnuojin/input_pipeline branch from dd92595 to 58af99b Compare June 25, 2025 05:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants