Skip to content

v1.0

Latest
Compare
Choose a tag to compare
@nshaud nshaud released this 07 May 14:08
· 1 commit to master since this release

This is the first release of the QS-OCR-Small and QS-OCR-Large datasets!

You can download here the archives containing the ocrized text from the Tobacco3482 and RVL-CDIP datasets using Tesseract OCR v4.0. The large dataset contains 400,000 text files labeled in 16 classes, and the small one 3,482 files for 10 classes.

They can be used to train text classifiers robust to OCR noise, e.g. missing words, substituted characters, hallucinated diacritics and so on.