Skip to content

Early Modern OCR Project

Open source tools and training for OCR'ing 15th-18th Century printed documents with Tesseract.

  • Ocular is a state-of-the-art historical OCR system.

    Java 2 40 GPL-3.0 Updated Aug 31, 2017
  • eMOP Controller

    Python 1 Updated Oct 27, 2016
  • code to remove "noise" from hOCR output of Tesseract OCR.

    Python 11 5 Apache-2.0 Updated Oct 24, 2016
  • Ruby 2 Apache-2.0 Updated Aug 18, 2016
  • Github organization page for the Early Modern OCR Project (eMOP)

    JavaScript 1 1 Apache-2.0 Updated Feb 3, 2016
  • A database of early modern printers and sellers culled from the eMOP source documents

    3 2 Apache-2.0 Updated Jan 20, 2016
  • Document level full-text of TCP transcribed ECCO docs (2188)

    3 3 CC0-1.0 Updated Jan 19, 2016
  • Baskerville typeface training for Gamera OCR engine

    Python Updated Dec 22, 2015
  • Ruby Apache-2.0 Updated Nov 19, 2015
  • Needed for emop-dashboard.

    Java Updated Nov 19, 2015
  • Scala code to correct Tesseract OCR output and generate ALTO XML and text files. Uses dictionary files, rules and a google-3gram DB to make corrections.

    Scala 7 Updated Nov 2, 2015
  • Converts Tesseract's hOCR output to ALTO, for eMOP

    3 Apache-2.0 Updated Oct 16, 2015
  • Java code to examine the output of Tesseract OCR and generate scores for general page quality and correctabiliby (see page-corrector repo).

    Java 6 1 Updated Sep 24, 2015
  • Training files produced for and by the Tesseract OCR engine for work on the Early Modern OCR Project (eMOP)

    27 6 Updated Sep 24, 2015
  • Part of eMOP: the Recursive Text Alignment Tool compares OCR text results to groundtruth by character and computes a score.

    Java 8 4 Updated Sep 24, 2015
  • Forked version of the Juxta Command Line tool created for eMOP by Performant Software Solutions. Will be official after eMOP is complete (10/1/14).

    Java 1 Apache-2.0 Updated Sep 24, 2015
  • Part of eMOP: Franken+ tool for creating font training for Tesseract OCR engine from page images.

    C# 20 7 Apache-2.0 Updated Sep 24, 2015
  • Cobre is a robust image comparison environment, presenting versions of texts in filmstrip view along side each other and collating these images of different texts while allowing users to adjust the collation.

    JavaScript 1 Apache-2.0 Updated Sep 24, 2015
  • Web-based page layout editor created for EMOP (Early Modern OCR Project).

    HTML 4 Apache-2.0 Updated Sep 24, 2015

Top languages


Most used topics




This organization has no public members. You must be a member to see who’s a part of this organization.

You can’t perform that action at this time.