Evaluation

OCR-D - evaluation of workflows and intermediate results

Problem statement
Tasks
Tools
Data
- structural GT (for preprocessing and segmentation)
- textual GT (for recognition)
Articles
Outlook

Problem statement

Which data and tools can we use to objectively measure quality and compare results of both complete workflows and individual steps (beyond final text Character Error Rate) on a non-representative sample?

Tasks

Preprocessing

Pixel-by-pixel comparison (e.g. for binarization: what percentage of black pixels in the output are also black in the GT)
Connected-component statistics, specialised measures (e.g. of vertical/horizontal projection profiles)
RGB/Grayscale entropy or other general purpose image measures
downstream segmentation quality/accuracy
downstream recognition quality (gotcha)
compare image histogram data with an ideal histogram or simple histogram classification (especially grayscale images) to determine what kind/how much preprocessing is needed in terms of saturation, hue, contrast, brightness, ...

@cneud: e.g. this is the state-of-the-art for binarization evaluation for example (sadly, tools are not made open/available by the authors)

Segmentation

https://github.com/OCR-D/ocrd_segment/wiki/SegmentationEvaluation

Recognition / Post-correction

Edit distance of characters/words after text alignment with GT
Edit distance or precision/recall after indexing GT (ignoring reading order and/or textline order and/or reading direction – "bag of words")
(only for post-correction:) precision/recall of correction

complementing historical dictionaries:

canonicalization of historic orthography (via hybrid stochastic and linguistic modelling) at BBAW: http://www.deutschestextarchiv.de/doku/software#cab note: has no rejection or confidence scoring, so every historical input will get an analysis
decanonicalization of modern orthography (called historical patterns) at CIS: https://github.com/cisocrgroup/Resources/tree/master/lexica (includes references) note: is always fuzzy, so inputs will always produce many different historical candidates

Tools

PRImA LayoutEval

https://www.primaresearch.org/alternative_download_links.html

https://www.dropbox.com/s/ky53r9k79tb0ywz/LayoutEvaluation_1.9.132.zip?dl=0

(partial source code:) https://github.com/PRImA-Research-Lab/prima-layout-eval

(partial documentation:) https://github.com/PRImA-Research-Lab/prima-layout-eval/blob/master/doc/liblayouteval.pdf

ocrd-segment-evaluate

A Free Software reimplementation of PRIMA LayoutEval in Python has been begun by @bertsky and @wrznr:

Dinglehopper

https://github.com/qurator-spk/dinglehopper

… text alignment CER/WER (mean) per page, visual comparison

cor-asv-ann-evaluate

https://github.com/ASVLeipzig/cor-asv-ann

… text alignment CER (mean+variance) with (multi-OCR) aggregation across pages/documents, confusion statistics, various metrics and normalization options (GT levels)

eddieantonio / ocreval

https://github.com/eddieantonio/ocreval

... ISRI evaluation tools with Unicode fixes

language-tool

Service for Dictionary Look-ups, already 25 Languages integrated, access via WEB-API

https://github.com/languagetool-org/languagetool

digital-eval

Tool for OCR evaluation of large, structured data sets with +1.000 items. Includes string-based evaluations based on normalized edit distance, as well as IR-Metrics and connector for dictionary metric based on language-tool (if present in containerized mode).

https://github.com/ulb-sachsen-anhalt/digital-eval

Data

Where can we find challenging, yet well-annotated data to test such evaluations?

structural GT (for preprocessing and segmentation)

How about OCR-D structure GT (1000pages DTA) @tboenig ?

textual GT (for recognition)

How about OCR-D structure+text GT?

Articles

Automatische Qualitätsverbesserung von Fraktur-Volltexten aus der Retrodigitalisierung am Beispiel der Zeitschrift Die Grenzboten

Outlook

Which eval tools could be wrapped with an OCR-D CLI with manageable effort?
What methodology do we use to evaluate processors, parameters and workflows? (GT curation, error/quality aggregation/slicing across meta-data, overall CER vs step-specific measures, )
How should evaluation fit into an OCR-D workflow runtime management (i.e. should missing a threshold value trigger the workflow to fail? The page? Should we reuse the ValidationReport mechanics used elsewhere?)

Welcome to the OCR-D wiki, a companion to the OCR-D website.

Articles and tutorials

Discussions

Expert section on OCR-D- workflows

Particular workflow steps

Recommended workflows

Successful Workflows for Particular Material (Template)

Workflow Guide

Videos

Section on Ground Truth

Provide feedback

Saved searches

Use saved searches to filter your results more quickly