An OCR Benchmarking Experiment

This repository holds replication materials for the manuscript "OCR with Tesseract, Amazon Textract, and Google Document AI: A Benchmarking Experiment". It contains:

The .RMD file of the manuscript. It should knit if you clone the repository and work within noisy-ocr-benchmark.Rproj.
51,304 .TXT files with the text output from all the OCR processing requests.
.CSV files with data for all the figures.

The image test materials reside in a separate Zenodo repository as the "Noisy OCR Dataset" (NOD).

Paper abstract:

Optical Character Recognition (OCR) can open up understudied historical documents to computational analysis, but the accuracy of OCR software varies. This article reports a benchmarking experiment comparing the performance of Tesseract, Amazon Textract, and Google Document AI on images of English and Arabic text. English-language book scans (n=322) and Arabic-language article scans (n=100) were replicated 43 times with different types of artificial noise for a corpus of 18,568 documents, generating 51,304 process requests. Document AI delivered the best results, and the server-based processors (Textract and Document AI) performed substantially better than Tesseract, especially on noisy documents. Accuracy for English was considerably higher than for Arabic. Specifying the relative performance of three leading OCR products and the differential effects of commonly found noise types can help scholars identify better OCR solutions for their research needs. The test materials have been preserved in the openly available “Noisy OCR Dataset” (NOD) for reuse in future benchmarking studies.

Core results:

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data		data
images		images
ms_rmd		ms_rmd
ocr_output		ocr_output
.gitignore		.gitignore
README.md		README.md
appendix.pdf		appendix.pdf
benchmarking_revised.pdf		benchmarking_revised.pdf
noisy-ocr-benchmark.Rproj		noisy-ocr-benchmark.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

An OCR Benchmarking Experiment

About

Releases

Packages

Languages

Hegghammer/noisy-ocr-benchmark

Folders and files

Latest commit

History

Repository files navigation

An OCR Benchmarking Experiment

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages