Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about TFRecord data shuffle in DALI. #4996

Open
1 task done
DuoblaK opened this issue Aug 15, 2023 · 1 comment
Open
1 task done

Question about TFRecord data shuffle in DALI. #4996

DuoblaK opened this issue Aug 15, 2023 · 1 comment
Assignees
Labels
question Further information is requested

Comments

@DuoblaK
Copy link

DuoblaK commented Aug 15, 2023

Question about TFRecord data shuffle in DALI

Hello~, I have some question about TFRecord data's random_shuffle in DALI.

For example, I have a dataset contains 8k images, when I make it a TFRecord data, it is spilt into 8 files like dataset.tfrecord-00000-of-00008, dataset.tfrecord-00001-of-00008... dataset.tfrecord-00007-of-00008, each of them contains 1k images.
When I use fn.readers.tfrecord(random_shuffle=True), how does it realize shuffle?
Situation 1: The 8 files random_shuffle in its own part, which can be thought as 8 separately random_shuffle, in each they random shuffle its own 1k images.
Situation 2: The 8 files random_shuffle together. They random shuffle 8k images together.

The reason I ask this question is because when I am using DALI do my training. Traing with DALI processing the data get a lower metric than trainging whithout DALI. But if I put random_shuffle=False and make they load data as the same order, the metric of them are nearly same. So I wondered if DALI TFRecord's random_shuffle maybe the reason causing the lower metric?

Thanks for your reading, it will help me a lot.

Check for duplicates

  • I have searched the open bugs/issues and have found no duplicates for this bug report
@DuoblaK DuoblaK added the question Further information is requested label Aug 15, 2023
@JanuszL
Copy link
Contributor

JanuszL commented Aug 16, 2023

Hi @DuoblaK,

Thank you for reaching out.
The general operation principle of the shuffling in DALI is described here.
DALI has a prefetch buffer, of the default length of 1000 samples, where it reads samples from the data set one by one, then it randomly samples it to form a batch.
If the data set consists of standalone files (one file represents one sample) the list of all files is initially mixed to make sure that all classes are equally distributed over the dataset. In the case of containers, like TFRecord it is expected that the data is initially mixed - so in the first part, all classes are present, not only the one from the beginning of the dataset. I suspect that this may be the case in the described experiment.
Please try to increase the initial_fill parameter value to recreate the dataset making sure it is initially preschuffled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants