training: shuffle data between epochs #74

wosiu · 2019-04-02T15:31:22Z

First of all - thank you for that fantastic framework! I've been using tesseract for more than 1 year, but this one is way better for a single line processing :)

Proposal:
From the logs during training, it seems that input images are not shuffled at all.
It would be nice, if they are shuffled at least at the very beginning.
And it would be perfect if data are also shuffled after each epoch, so that different batches are created.

wosiu · 2019-04-02T15:37:20Z

I could try to investigate that and make a PR. Let me know if that makes sense to you. Any hints are appreciated.

ChWick · 2019-04-03T07:36:02Z

See here:

dataset = dataset.repeat().shuffle(buffer_size, seed=self.network_proto.backend.random_seed)

The dataset is shuffled using the tensorflow api (not during loading), however only up to a maximum of buffer_size (by default 1000 instances).

wosiu · 2019-04-10T17:00:58Z

What is the reason to have this buffer_size limit for deciding whether to shuffle data or not?
Why 1000 by default?
I've got way more data instances and still would love to have them shuffled between epochs.

ChWick · 2019-04-11T07:25:22Z

The reason is, that this size is direcyly related to the required RAM but there is no deeper reason for this size. We chose this value, because it ensures that Calamari runs also on weaker/older laptops that are in use.
Probably, one should make this a parameter and changeably via the command line (or network parameters).

wosiu · 2019-05-07T20:31:57Z

+1 I'll try then to develop it and PR in free time. Please keep this issue open.

ChWick · 2019-05-08T08:33:06Z

@wosiu in 7b3be26 I added a training parameter to define the buffer size (--shuffle_buffer_size). Use 0 if you want the shuffle buffer the same size as the dataset, i. e. one epoch.

wosiu · 2019-05-08T12:09:41Z

Thanks!
I'd like to ask about one last thing - I'm using "--train_data_on_the_fly" because my dataset is huge. In that case batches are loaded on the fly, when needed, right? So my understading is that with "--train_data_on_the_fly" a dataset could be always shuffled after each epoch, no matter how big the dataset is. Because the only thing that needs to be shuffled is actually list of paths to images/gt.txt. The shuffled list is then used by parallel queue tasks to read data, and that is repeated after each epoch.
Am I thinking correctly? Does it work like this?

ChWick · 2019-05-08T13:50:24Z

@wosiu This is correct, and was only a small change to the code. Therefore, with the change, you could set the shuffle buffer to a very small size because it is irrelevant if you are training with default (line_image, text) pairs. It is still required if you load PageXML-Files, or raw hdf5-files for training.

wosiu · 2019-05-08T16:07:38Z

Looks perfect, many thanks :) Closing.

ChWick added a commit that referenced this issue May 8, 2019

Added training parameter to define the size of the shuffle buffer (#74)

7b3be26

ChWick added a commit that referenced this issue May 8, 2019

Shuffling the files on training in each epoch: #74

61ae2ea

wosiu closed this as completed May 8, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

training: shuffle data between epochs #74

training: shuffle data between epochs #74

wosiu commented Apr 2, 2019 •

edited

wosiu commented Apr 2, 2019 •

edited

ChWick commented Apr 3, 2019

wosiu commented Apr 10, 2019 •

edited

ChWick commented Apr 11, 2019

wosiu commented May 7, 2019

ChWick commented May 8, 2019

wosiu commented May 8, 2019 •

edited

ChWick commented May 8, 2019

wosiu commented May 8, 2019

training: shuffle data between epochs #74

training: shuffle data between epochs #74

Comments

wosiu commented Apr 2, 2019 • edited

wosiu commented Apr 2, 2019 • edited

ChWick commented Apr 3, 2019

wosiu commented Apr 10, 2019 • edited

ChWick commented Apr 11, 2019

wosiu commented May 7, 2019

ChWick commented May 8, 2019

wosiu commented May 8, 2019 • edited

ChWick commented May 8, 2019

wosiu commented May 8, 2019

wosiu commented Apr 2, 2019 •

edited

wosiu commented Apr 2, 2019 •

edited

wosiu commented Apr 10, 2019 •

edited

wosiu commented May 8, 2019 •

edited