Question about TFRecord data shuffle in DALI. #4996

DuoblaK · 2023-08-15T16:48:54Z

Question about TFRecord data shuffle in DALI

Hello~, I have some question about TFRecord data's random_shuffle in DALI.

For example, I have a dataset contains 8k images, when I make it a TFRecord data, it is spilt into 8 files like dataset.tfrecord-00000-of-00008, dataset.tfrecord-00001-of-00008... dataset.tfrecord-00007-of-00008, each of them contains 1k images.
When I use fn.readers.tfrecord(random_shuffle=True), how does it realize shuffle?
Situation 1: The 8 files random_shuffle in its own part, which can be thought as 8 separately random_shuffle, in each they random shuffle its own 1k images.
Situation 2: The 8 files random_shuffle together. They random shuffle 8k images together.

The reason I ask this question is because when I am using DALI do my training. Traing with DALI processing the data get a lower metric than trainging whithout DALI. But if I put random_shuffle=False and make they load data as the same order, the metric of them are nearly same. So I wondered if DALI TFRecord's random_shuffle maybe the reason causing the lower metric?

Thanks for your reading, it will help me a lot.

Check for duplicates

I have searched the open bugs/issues and have found no duplicates for this bug report

The text was updated successfully, but these errors were encountered:

JanuszL · 2023-08-16T07:58:46Z

Hi @DuoblaK,

Thank you for reaching out.
The general operation principle of the shuffling in DALI is described here.
DALI has a prefetch buffer, of the default length of 1000 samples, where it reads samples from the data set one by one, then it randomly samples it to form a batch.
If the data set consists of standalone files (one file represents one sample) the list of all files is initially mixed to make sure that all classes are equally distributed over the dataset. In the case of containers, like TFRecord it is expected that the data is initially mixed - so in the first part, all classes are present, not only the one from the beginning of the dataset. I suspect that this may be the case in the described experiment.
Please try to increase the initial_fill parameter value to recreate the dataset making sure it is initially preschuffled.

DuoblaK added the question Further information is requested label Aug 15, 2023

jantonguirao assigned szalpal Aug 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about TFRecord data shuffle in DALI. #4996

Question about TFRecord data shuffle in DALI. #4996

DuoblaK commented Aug 15, 2023 •

edited

Loading

JanuszL commented Aug 16, 2023

Question about TFRecord data shuffle in DALI. #4996

Question about TFRecord data shuffle in DALI. #4996

Comments

DuoblaK commented Aug 15, 2023 • edited Loading