Skip to content

Workflow Guide clipping

Robert Sachunsky edited this page Oct 7, 2020 · 2 revisions

In this processing step, intrusions of neighbouring non-text (e.g. separator) or text segments (e.g. ascenders/descenders) into text regions of a page (or text lines or a text region) can be removed. A connected component analysis is run on every segment, as well as its overlapping neighbours. Now for each conflicting binary object, a rule based on majority and proper containment determines whether it belongs to the neighbour, and can therefore be clipped to the background.

This basic text-nontext segmentation ensures that for each text region there is a clean image without interference from separators and neighbouring texts. (On the region level, cleaning via coordinates would be impossible in many common cases.) On the line level, this can be seen as an alternative to resegmentation.

Note: Clipping must be applied before any processor that produces derived images for the same hierarchy level (region/line). Annotations on the next higher level (page/region) are fine of course.

Available processors

>
Processor Parameter Remarks Call
ocrd-cis-ocropy-clip -P level-of-operation region   ocrd-cis-ocropy-clip -I OCR-D-DESKEW-REG -O OCR-D-CLIP-REG -P level-of-operation region

Notes on parameter usage

E.g.

  • which parameters do you use with what values?
  • which parameters are insufficiently documented?
  • which aspects of a processor should be parameterizable but are not?

Notes on document-specific usage

E.g. which processors worked best with what material? -- feel free to post sample images here, too.

Welcome to the OCR-D wiki, a companion to the OCR-D website.

Articles and tutorials
Discussions
Expert section on OCR-D- workflows
Particular workflow steps
Recommended workflows
Workflow Guide
Videos
Section on Ground Truth
Clone this wiki locally