Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

training: shuffle data between epochs #74

Closed
wosiu opened this issue Apr 2, 2019 · 9 comments
Closed

training: shuffle data between epochs #74

wosiu opened this issue Apr 2, 2019 · 9 comments

Comments

@wosiu
Copy link
Contributor

wosiu commented Apr 2, 2019

First of all - thank you for that fantastic framework! I've been using tesseract for more than 1 year, but this one is way better for a single line processing :)

Proposal:
From the logs during training, it seems that input images are not shuffled at all.
It would be nice, if they are shuffled at least at the very beginning.
And it would be perfect if data are also shuffled after each epoch, so that different batches are created.

@wosiu
Copy link
Contributor Author

wosiu commented Apr 2, 2019

I could try to investigate that and make a PR. Let me know if that makes sense to you. Any hints are appreciated.

@ChWick
Copy link
Member

ChWick commented Apr 3, 2019

See here:

dataset = dataset.repeat().shuffle(buffer_size, seed=self.network_proto.backend.random_seed)

The dataset is shuffled using the tensorflow api (not during loading), however only up to a maximum of buffer_size (by default 1000 instances).

@wosiu
Copy link
Contributor Author

wosiu commented Apr 10, 2019

What is the reason to have this buffer_size limit for deciding whether to shuffle data or not?
Why 1000 by default?
I've got way more data instances and still would love to have them shuffled between epochs.

@ChWick
Copy link
Member

ChWick commented Apr 11, 2019

The reason is, that this size is direcyly related to the required RAM but there is no deeper reason for this size. We chose this value, because it ensures that Calamari runs also on weaker/older laptops that are in use.
Probably, one should make this a parameter and changeably via the command line (or network parameters).

@wosiu
Copy link
Contributor Author

wosiu commented May 7, 2019

+1 I'll try then to develop it and PR in free time. Please keep this issue open.

@ChWick
Copy link
Member

ChWick commented May 8, 2019

@wosiu in 7b3be26 I added a training parameter to define the buffer size (--shuffle_buffer_size). Use 0 if you want the shuffle buffer the same size as the dataset, i. e. one epoch.

@wosiu
Copy link
Contributor Author

wosiu commented May 8, 2019

Thanks!
I'd like to ask about one last thing - I'm using "--train_data_on_the_fly" because my dataset is huge. In that case batches are loaded on the fly, when needed, right? So my understading is that with "--train_data_on_the_fly" a dataset could be always shuffled after each epoch, no matter how big the dataset is. Because the only thing that needs to be shuffled is actually list of paths to images/gt.txt. The shuffled list is then used by parallel queue tasks to read data, and that is repeated after each epoch.
Am I thinking correctly? Does it work like this?

@ChWick
Copy link
Member

ChWick commented May 8, 2019

@wosiu This is correct, and was only a small change to the code. Therefore, with the change, you could set the shuffle buffer to a very small size because it is irrelevant if you are training with default (line_image, text) pairs. It is still required if you load PageXML-Files, or raw hdf5-files for training.

@wosiu
Copy link
Contributor Author

wosiu commented May 8, 2019

Looks perfect, many thanks :) Closing.

@wosiu wosiu closed this as completed May 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants