New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
training: shuffle data between epochs #74
Comments
I could try to investigate that and make a PR. Let me know if that makes sense to you. Any hints are appreciated. |
See here:
The dataset is shuffled using the tensorflow api (not during loading), however only up to a maximum of |
What is the reason to have this |
The reason is, that this size is direcyly related to the required RAM but there is no deeper reason for this size. We chose this value, because it ensures that Calamari runs also on weaker/older laptops that are in use. |
+1 I'll try then to develop it and PR in free time. Please keep this issue open. |
Thanks! |
@wosiu This is correct, and was only a small change to the code. Therefore, with the change, you could set the shuffle buffer to a very small size because it is irrelevant if you are training with default (line_image, text) pairs. It is still required if you load PageXML-Files, or raw hdf5-files for training. |
Looks perfect, many thanks :) Closing. |
First of all - thank you for that fantastic framework! I've been using tesseract for more than 1 year, but this one is way better for a single line processing :)
Proposal:
From the logs during training, it seems that input images are not shuffled at all.
It would be nice, if they are shuffled at least at the very beginning.
And it would be perfect if data are also shuffled after each epoch, so that different batches are created.
The text was updated successfully, but these errors were encountered: