v1.0

Latest

nshaud released this 07 May 14:08

· 1 commit to master since this release

d1d73c5

This is the first release of the QS-OCR-Small and QS-OCR-Large datasets!

You can download here the archives containing the ocrized text from the Tobacco3482 and RVL-CDIP datasets using Tesseract OCR v4.0. The large dataset contains 400,000 text files labeled in 16 classes, and the small one 3,482 files for 10 classes.

They can be used to train text classifiers robust to OCR noise, e.g. missing words, substituted characters, hallucinated diacritics and so on.

Assets 4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.0