This is the first release of the QS-OCR-Small and QS-OCR-Large datasets!
You can download here the archives containing the ocrized text from the Tobacco3482 and RVL-CDIP datasets using Tesseract OCR v4.0. The large dataset contains 400,000 text files labeled in 16 classes, and the small one 3,482 files for 10 classes.
They can be used to train text classifiers robust to OCR noise, e.g. missing words, substituted characters, hallucinated diacritics and so on.