Add fontshape processor and all-in-one segmentation #158

bertsky · 2020-10-19T11:19:22Z

We can probably remove both the old segment-region/line/word and new (all-in-one) segment altogether now that we can configure them via overwrite_* and textequiv_level in recognize. Or we keep the CLI names, but delegate to recognize @kba?

- new processor `segment`, using the `AnalyseLayout` iterator for all hierarchy levels at once (avoiding textline overlap between regions); this sidelines the existing isolated processors `segment-{region,line,word}` and ultimately replace them; postprocessing by polygonalisation of text lines and shrinking/clipping of text regions is still necessary though - `recognize`, `segment-line`, `segment-word`: use fully recursive region iterator - new processor `fontshape`, using pre-LSTM recognition models to query `WordFontAttributes` for all word bboxes and annotate them via `TextStyle`

…ributes

- use `overwrite_regions/lines/words` to enable segmentation - use `textequiv_level=none` to disable recognition - use Tesseract's `AnalyseLayout` for segmentation-only, but `Recognize` for segmentation and recognition - Tesseract gets image at the highest necessary hierarchy level, one shared iterator from there on - `padding` means raw pixels for segmentation and artificial pixels for existing images - integrate all page segmentation parameters (`sparse_text`, `find_tables`, `block_polygons` ...)

codecov · 2020-10-19T11:21:39Z

Codecov Report

Merging #158 (c44da6b) into master (24b7ced) will increase coverage by 2.85%.
The diff coverage is 36.89%.

@@            Coverage Diff             @@
##           master     #158      +/-   ##
==========================================
+ Coverage   37.73%   40.58%   +2.85%     
==========================================
  Files           9       11       +2     
  Lines        1023     1126     +103     
  Branches      216      236      +20     
==========================================
+ Hits          386      457      +71     
- Misses        565      585      +20     
- Partials       72       84      +12

Impacted Files	Coverage Δ
ocrd_tesserocr/deskew.py	`15.00% <0.00%> (-0.47%)`	⬇️
ocrd_tesserocr/crop.py	`13.67% <4.34%> (+0.94%)`	⬆️
ocrd_tesserocr/fontshape.py	`17.64% <17.64%> (ø)`
ocrd_tesserocr/segment_table.py	`36.00% <17.64%> (+36.00%)`	⬆️
ocrd_tesserocr/segment.py	`39.13% <39.13%> (ø)`
ocrd_tesserocr/recognize.py	`48.29% <42.36%> (-0.46%)`	⬇️
ocrd_tesserocr/segment_line.py	`96.00% <94.11%> (+23.69%)`	⬆️
ocrd_tesserocr/segment_word.py	`96.00% <94.11%> (+23.27%)`	⬆️
ocrd_tesserocr/segment_region.py	`96.29% <94.73%> (+46.86%)`	⬆️
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 24b7ced...c44da6b. Read the comment docs.

ocrd_tesserocr/fontshape.py

ocrd_tesserocr/recognize.py

setup.py

kba

Great stuff

kba · 2020-10-19T14:42:27Z

We can probably remove both the old segment-region/line/word and new (all-in-one) segment altogether now that we can configure them via overwrite_* and textequiv_level in recognize. Or we keep the CLI names, but delegate to recognize @kba?

We should keep the segment{,-{region,word,line} CLIs so we don't break existing workflows. If they can delegate to recognize with fixed textequiv_level and overwrite_LEVEL parameters, that would be best. And reduce code duplication, making recognize more complex but much more powerful.

bertsky · 2020-10-19T14:46:21Z

If they can delegate to recognize with fixed textequiv_level and overwrite_LEVEL parameters, that would be best.

One problem I still see with such delegation is that we would still need to repeat the tool json. And we would still have to write some thin module instantiating TesserocrRecognize and translating the call.

kba · 2020-10-19T14:54:26Z

One problem I still see with such delegation is that we would still need to repeat the tool json. And we would still have to write some thin module instantiating TesserocrRecognize and translating the call.

Duplicate info in the ocrd-tool.json is okay I think, it's one place and the user doesn't normally see it. If that is too error-prone, a generator script for ocrd-tool.json might help.

thin module instantiating TesserocrRecognize and translating the call.

Yes, but you could inherit/encapsulate TesserOcrRecognize and do smth like

class TesserocrLineSegmenter(TesserocrRecognize):
  def process(self):
    self.parameter['overwrite_lines'] = True
    self.parameter['textequiv_level'] = "none"
    return super().process(self)

bertsky · 2020-10-19T20:19:04Z

Yes, but you could inherit/encapsulate TesserOcrRecognize and do smth like

class TesserocrLineSegmenter(TesserocrRecognize):
  def process(self):
    self.parameter['overwrite_lines'] = True
    self.parameter['textequiv_level'] = "none"
    return super().process(self)

It's not quite as easy I'm afraid. In the constructor, we must pick the derived tool from the tool.json tools (or -h and -J won't work.) So we need to parse the derived tool json and apply it against the input params, but then in process have to fill in the defaults of the original tool json. So the superclass' json must be dumped, a ParameterValidator be constructed for that, and applied against the input params a second time.

Or am I getting it all wrong?

kba · 2020-10-20T09:12:40Z

How about this:

from ocrd import Processor
from .config import OCRD_TOOL

from .recognize import TesserocrRecognize
TOOL = 'ocrd-tesserocr-segment-line'

class TesserocrSegmentLine(Processor):

    def __init__(self, *args, **kwargs):
        kwargs['ocrd_tool'] = OCRD_TOOL['tools'][TOOL]
        kwargs['version'] = OCRD_TOOL['version']
        super().__init__(*args, **kwargs)

        recognize_kwargs = {**kwargs}
        recognize_kwargs.pop('show_help', None)
        recognize_kwargs.pop('show_version', None)
        recognize_kwargs['parameter'] = {}
        recognize_kwargs['parameter']['overwrite_lines'] = True
        recognize_kwargs['parameter']['textequiv_level'] = "none"
        self.recognizer = TesserocrRecognize(**recognize_kwargs)

    def process(self):
        return self.recognizer.process()

bertsky · 2020-10-20T09:19:53Z

How about this:

We still need to do sth about the workspace argument.

~~Maybe pass None so it also won't chdir a second time?~~ EDIT We do need to pass the workspace, but not in -h/-J/-V context, and still prevent chdir twice

kba · 2020-10-20T09:26:54Z

How about this:

We still need to do sth about the workspace argument. Maybe pass None so it also won't chdir a second time?

You mean "... and assign workspace to self.recognizer manually after instantiation"?

i.e.

        self.recognizer = TesserocrRecognize(workspace=None, **recognize_kwargs)
        self.recognizer.workspace = self.workspace

That would work if the second chdir bothers you.

bertsky · 2020-10-20T09:29:24Z

That would work if the second chdir bothers you.

You still need to pass None for instantiation (workspace is not a kwarg, ~~and chdir would fail~~ [correction: 2nd chdir does no harm, as workspace.directory always comes from resolver.workspace_from_url which does a Path.resolve]).

        self.recognizer.workspace = self.workspace

Yes, in combination with that it should work.

bertsky · 2020-10-20T11:16:40Z

Trying to make the segment* CLIs delegate to recognize, I realize that we do still have a conceptual discrepancy: The latter is not capable of incremental annotation, so it does not treat the overwrite_* parameters as the former do. (Even the former were not truly incremental, they just added new segments redundantly. A true incremental behaviour would mask the existing segments.) How should we approach this?

bertsky · 2020-10-20T12:09:21Z

Trying to make the segment* CLIs delegate to recognize, I realize that we do still have a conceptual discrepancy: The latter is not capable of incremental annotation, so it does not treat the overwrite_* parameters as the former do. (Even the former were not truly incremental, they just added new segments redundantly. A true incremental behaviour would mask the existing segments.) How should we approach this?

Here's my proposal: Since the existing overwrite_* parameters did not really do anything useful (as they merely allowed redundant segments when False), we just ignore them from now on. The new overwrite_* parameters in recognize are independent. If I add some form of incremental annotation some day, I will slightly change their meaning (still triggering segmentation, but not removing existing segments/areas anymore).

kba · 2020-10-20T12:12:13Z

we just ignore them from now on

As far as I can grok the consequences, that seems reasonable.

bertsky · 2020-10-20T12:19:57Z

The new overwrite_* parameters in recognize are independent. If I add some form of incremental annotation some day, I will slightly change their meaning (still triggering segmentation, but not removing existing segments/areas anymore).

Or, we might already give them different names in recognize, say detect_regions, detect_lines, detect_words (and have no overwrite_* for now). That way, once we do add incremental annotation, we can still turn it off via overwrite_regions=True etc.

Ignoring the old overwrite_* would be as simple as suppressing them in the parameter dict during delegation.

kba · 2020-10-20T12:21:50Z

The new overwrite_* parameters in recognize are independent. If I add some form of incremental annotation some day, I will slightly change their meaning (still triggering segmentation, but not removing existing segments/areas anymore).

Or, we might already give them different names in recognize, say detect_regions, detect_lines, detect_words (and have no overwrite_* for now). That way, once we do add incremental annotation, we can still turn it off via overwrite_regions=True etc.

detect_LEVEL is certainly more intuitive than overwrite_LEVEL with a different meaning in different contexts.

bertsky · 2020-10-20T12:23:49Z

detect_LEVEL is certainly more intuitive than overwrite_LEVEL with a different meaning in different contexts.

Agreed. Last one: detect_regions or segment_regions? (Text recognition might also be called "detection" in a sense.)

EDIT Ah, but that opens up the pit of "segment regions into lines" vs "segment regions in pages" again...

kba · 2020-10-20T12:27:42Z

I think detect_LEVEL is better than segment_LEVEL because in the former it is clear that LEVEL is what is being detected, whereas in the latter, it could mean "find LEVEL+1 in LEVEL".

bertsky · 2020-10-20T12:37:06Z

Ignoring the old overwrite_* would be as simple as suppressing them in the parameter dict during delegation.

Since this part of the discussion was all about backwards-compatibility, I think it would be fair if we made these parameters "true-only" in the tool json (via "enum": [true]). That way, if anyone used False in their workflow, they will know they have to look at this and change their configuration.

kba · 2020-10-20T12:42:53Z

That would work if the second chdir bothers you.

You still need to pass None for instantiation (workspace is not a kwarg, and chdir would fail).

Sure, copy/paste

        self.recognizer.workspace = self.workspace
Yes, in combination with that it should work.

Ignoring the old overwrite_* would be as simple as suppressing them in the parameter dict during delegation.

Since this part of the discussion was all about backwards-compatibility, I think it would be fair if we made these parameters "true-only" in the tool json (via "enum": [true]). That way, if anyone used False in their workflow, they will know they have to look at this and change their configuration.

👍 I've never used enums for anything other than strings acoording to https://json-schema.org/understanding-json-schema/reference/generic.html#enumerated-values {"type": boolean, "enum": [true]} is indeed a valid construct. Nice, learned something new.

bertsky · 2020-10-20T12:51:01Z

I've never used enums for anything other than strings acoording to https://json-schema.org/understanding-json-schema/reference/generic.html#enumerated-values {"type": boolean, "enum": [true]} is indeed a valid construct. Nice, learned something new.

Yes, and it will give these hilarious tautological responses:

Invalid parameters ['[overwrite_words] False is not one of [True]']

bertsky · 2020-11-06T14:27:05Z

I just have run a new test with latest ocrd_all + latest version of this pull request and see a major regression with many lines of text missing in the final result (compare old with new).

Thanks @stweil! I forgot I need to jump across RIL_PARA when I want text lines in text blocks.

…ven for 0°)

kba · 2020-11-17T10:07:53Z

I just have run a new test with latest ocrd_all + latest version of this pull request and see a major regression with many lines of text missing in the final result (compare old with new).

Thanks @stweil! I forgot I need to jump across RIL_PARA when I want text lines in text blocks.

That was fixed in 1ac011e right?

keep image padding, but set defaults for both kinds of padding to 0px (as before)

remove image padding, start a new PR for it (and get involved in the GT measurements there)

expose a different parameter name for image padding, say image_padding, which defaults to 0px

What was the resolution here?

bertsky · 2020-11-17T10:21:55Z

Thanks @stweil! I forgot I need to jump across RIL_PARA when I want text lines in text blocks.

That was fixed in 1ac011e right?

Yes!

keep image padding, but set defaults for both kinds of padding to 0px (as before)

remove image padding, start a new PR for it (and get involved in the GT measurements there)

expose a different parameter name for image padding, say image_padding, which defaults to 0px

What was the resolution here?

87e6444 implements option 1, which is okay with me, too.

We'll need some announcements and updates to the workflow guides when/after merging this, probably a dedicated call. Same for cisocrgroup/ocrd_cis#77 (which is not quite ready yet). That's why I waited for you. We have to stir everything up a bit, but let's at least minimise the rumble by doing it all at once.

kba · 2020-11-17T12:28:53Z

We'll need some announcements and updates to the workflow guides when/after merging this, probably a dedicated call. Same for cisocrgroup/ocrd_cis#77 (which is not quite ready yet).

OK, let's discuss what needs to change documentation-wise in the call thursday and plan to have this, the ocrd_cis resegment PR and depending documentation changs ready for the open tech call next week? Is that realistic?

bertsky · 2020-11-17T12:39:56Z

OK, let's discuss what needs to change documentation-wise in the call thursday and plan to have this, the ocrd_cis resegment PR and depending documentation changs ready for the open tech call next week? Is that realistic?

Yes, agreed!

bertsky · 2020-11-19T07:37:49Z

Hmm, sorry for adding yet another verse to the epic, but I must revisit the decision to give up the overwrite_* parameters (allowing only True from now on):

Since the existing overwrite_* parameters did not really do anything useful (as they merely allowed redundant segments when False), we just ignore them from now on.

I forgot that there is in fact one valid use-case for these parameters being false, which is input which sometimes has sub-segments, but sometimes does not (because the previous segmentation failed on that segment). For example, segment-line with overwrite_lines=False does make sense if I expect that regions sometimes contain lines already (where I don't want to add more), but sometimes do not (where I want to find some). That's not the same as the old behaviour (which would add segments regardless of existing ones), and not the same as the more desirable incremental behaviour (which would add segments only where no existing ones are located), but at least one reason for still allowing False even now.

Should I make the respective changes?

kba · 2020-11-19T10:09:35Z

ok, if we do need overwrite_* to allow users to only "fill in the blanks" in inconsistently segmented data, then indeed please reintroduce them.

- recognize: introduce `overwrite_text` param (default still `true`) Allow merely adding TextEquiv as non-first result. - recognize: introduce `overwrite_segments` param (default still `false`) Whenever `segmentation_level` warrants overwriting existing segments, allow keeping and descending them. - recognize/segment: Use the same logic for existing tables as other segments: if cells exist already, unless `overwrite` is true, keep and descend them. - segment-{region,table,line,word}: delegate old `overwrite_*` param to new `overwrite_segment` in `recognize` - segment-table: rename param `overwrite_regions` to `overwrite_cells`

kba · 2020-12-02T09:59:26Z

🎉

bertsky added 4 commits October 19, 2020 01:40

Merge remote-tracking branch 'upstream/master' into add-word-font-att…

52999b0

…ributes

recognize: remove WordFontAttributes

672020d

bertsky requested review from kba and wrznr October 19, 2020 11:19

kba reviewed Oct 19, 2020

View reviewed changes

ocrd_tesserocr/fontshape.py Show resolved Hide resolved

kba reviewed Oct 19, 2020

View reviewed changes

ocrd_tesserocr/recognize.py Outdated Show resolved Hide resolved

kba reviewed Oct 19, 2020

View reviewed changes

setup.py Outdated Show resolved Hide resolved

kba approved these changes Oct 19, 2020

View reviewed changes

segment: delegate all implementation to recognize

4387e94

bertsky added 2 commits November 6, 2020 15:35

recognize: skip intermediate paragraph iterator level when necessary

1ac011e

deskew: annotate image feature 'deskewed' if skew or not

39fa0c4

bertsky force-pushed the add-word-font-attributes branch from 967af65 to 39fa0c4 Compare November 6, 2020 14:35

bertsky added 5 commits November 9, 2020 22:45

deskew: delegate to core for reflection and rotation

41bfe21

crop: also use existing text regions, if any

e106a27

deskew: always extract new image, always annotate feature deskewed (e…

5a5ca46

…ven for 0°)

segment/recognize: revert to old default 0px padding

87e6444

update changelog

5cbc8cb

bertsky mentioned this pull request Nov 10, 2020

Image API: refactor, re-crop and re-interprete OCR-D/core#640

Merged

bertsky mentioned this pull request Nov 20, 2020

shapely.errors.TopologicalError: The operation 'GEOSWithin_r' could not be performed. Likely cause is invalidity of the geometry <shapely.geometry.polygon.Polygon object at 0x7fb58a658b80> #160

Closed

polygon_for_parent: ensure path validity before checking consistency

3114123

kba linked an issue Nov 20, 2020 that may be closed by this pull request

shapely.errors.TopologicalError: The operation 'GEOSWithin_r' could not be performed. Likely cause is invalidity of the geometry <shapely.geometry.polygon.Polygon object at 0x7fb58a658b80> #160

Closed

bertsky removed a link to an issue Nov 20, 2020

shapely.errors.TopologicalError: The operation 'GEOSWithin_r' could not be performed. Likely cause is invalidity of the geometry <shapely.geometry.polygon.Polygon object at 0x7fb58a658b80> #160

Closed

bertsky linked an issue Nov 20, 2020 that may be closed by this pull request

shapely.errors.TopologicalError: The operation 'GEOSWithin_r' could not be performed. Likely cause is invalidity of the geometry <shapely.geometry.polygon.Polygon object at 0x7fb58a658b80> #160

Closed

bertsky added 6 commits November 23, 2020 07:43

ensure valid polygons for new coords

2d7093a

recognize: skip segments with zero height or width

3cf967f

deskew: skip segments with zero height or width

40893b5

recognize: fix RIL in terminal GetUTF8Text

c629ff0

recognize: fix Confidence() vs MeanTextConf()

c44da6b

bertsky merged commit 056d30d into OCR-D:master Dec 1, 2020

bertsky mentioned this pull request Dec 6, 2020

Polygonalize segments by shrinking to children #162

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add fontshape processor and all-in-one segmentation #158

Add fontshape processor and all-in-one segmentation #158

bertsky commented Oct 19, 2020

codecov bot commented Oct 19, 2020 •

edited

kba left a comment

kba commented Oct 19, 2020

bertsky commented Oct 19, 2020

kba commented Oct 19, 2020

bertsky commented Oct 19, 2020

kba commented Oct 20, 2020

bertsky commented Oct 20, 2020 •

edited

kba commented Oct 20, 2020 •

edited

bertsky commented Oct 20, 2020 •

edited

bertsky commented Oct 20, 2020

bertsky commented Oct 20, 2020

kba commented Oct 20, 2020

bertsky commented Oct 20, 2020

kba commented Oct 20, 2020

bertsky commented Oct 20, 2020 •

edited

kba commented Oct 20, 2020

bertsky commented Oct 20, 2020

kba commented Oct 20, 2020

bertsky commented Oct 20, 2020

bertsky commented Nov 6, 2020

kba commented Nov 17, 2020

bertsky commented Nov 17, 2020

kba commented Nov 17, 2020

bertsky commented Nov 17, 2020

bertsky commented Nov 19, 2020

kba commented Nov 19, 2020

kba commented Dec 2, 2020

Add fontshape processor and all-in-one segmentation #158

Add fontshape processor and all-in-one segmentation #158

Conversation

bertsky commented Oct 19, 2020

codecov bot commented Oct 19, 2020 • edited

Codecov Report

kba left a comment

Choose a reason for hiding this comment

kba commented Oct 19, 2020

bertsky commented Oct 19, 2020

kba commented Oct 19, 2020

bertsky commented Oct 19, 2020

kba commented Oct 20, 2020

bertsky commented Oct 20, 2020 • edited

kba commented Oct 20, 2020 • edited

bertsky commented Oct 20, 2020 • edited

bertsky commented Oct 20, 2020

bertsky commented Oct 20, 2020

kba commented Oct 20, 2020

bertsky commented Oct 20, 2020

kba commented Oct 20, 2020

bertsky commented Oct 20, 2020 • edited

kba commented Oct 20, 2020

bertsky commented Oct 20, 2020

kba commented Oct 20, 2020

bertsky commented Oct 20, 2020

bertsky commented Nov 6, 2020

kba commented Nov 17, 2020

bertsky commented Nov 17, 2020

kba commented Nov 17, 2020

bertsky commented Nov 17, 2020

bertsky commented Nov 19, 2020

kba commented Nov 19, 2020

kba commented Dec 2, 2020

codecov bot commented Oct 19, 2020 •

edited

bertsky commented Oct 20, 2020 •

edited

kba commented Oct 20, 2020 •

edited

bertsky commented Oct 20, 2020 •

edited

bertsky commented Oct 20, 2020 •

edited