Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fontshape processor and all-in-one segmentation #158

Merged
merged 31 commits into from Dec 1, 2020

Conversation

bertsky
Copy link
Collaborator

@bertsky bertsky commented Oct 19, 2020

We can probably remove both the old segment-region/line/word and new (all-in-one) segment altogether now that we can configure them via overwrite_* and textequiv_level in recognize. Or we keep the CLI names, but delegate to recognize @kba?

- new processor `segment`, using the `AnalyseLayout` iterator
  for all hierarchy levels at once (avoiding textline overlap
  between regions); this sidelines the existing isolated
  processors `segment-{region,line,word}` and ultimately replace
  them; postprocessing by polygonalisation of text lines and
  shrinking/clipping of text regions is still necessary though
- `recognize`, `segment-line`, `segment-word`: use fully recursive
  region iterator
- new processor `fontshape`, using pre-LSTM recognition models
  to query `WordFontAttributes` for all word bboxes and annotate
  them via `TextStyle`
- use `overwrite_regions/lines/words` to enable segmentation
- use `textequiv_level=none` to disable recognition
- use Tesseract's `AnalyseLayout` for segmentation-only, but
  `Recognize` for segmentation and recognition
- Tesseract gets image at the highest necessary hierarchy level,
  one shared iterator from there on
- `padding` means raw pixels for segmentation and artificial pixels
  for existing images
- integrate all page segmentation parameters
  (`sparse_text`, `find_tables`, `block_polygons` ...)
@bertsky bertsky requested review from kba and wrznr October 19, 2020 11:19
@codecov
Copy link

codecov bot commented Oct 19, 2020

Codecov Report

Merging #158 (c44da6b) into master (24b7ced) will increase coverage by 2.85%.
The diff coverage is 36.89%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #158      +/-   ##
==========================================
+ Coverage   37.73%   40.58%   +2.85%     
==========================================
  Files           9       11       +2     
  Lines        1023     1126     +103     
  Branches      216      236      +20     
==========================================
+ Hits          386      457      +71     
- Misses        565      585      +20     
- Partials       72       84      +12     
Impacted Files Coverage Δ
ocrd_tesserocr/deskew.py 15.00% <0.00%> (-0.47%) ⬇️
ocrd_tesserocr/crop.py 13.67% <4.34%> (+0.94%) ⬆️
ocrd_tesserocr/fontshape.py 17.64% <17.64%> (ø)
ocrd_tesserocr/segment_table.py 36.00% <17.64%> (+36.00%) ⬆️
ocrd_tesserocr/segment.py 39.13% <39.13%> (ø)
ocrd_tesserocr/recognize.py 48.29% <42.36%> (-0.46%) ⬇️
ocrd_tesserocr/segment_line.py 96.00% <94.11%> (+23.69%) ⬆️
ocrd_tesserocr/segment_word.py 96.00% <94.11%> (+23.27%) ⬆️
ocrd_tesserocr/segment_region.py 96.29% <94.73%> (+46.86%) ⬆️
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 24b7ced...c44da6b. Read the comment docs.

ocrd_tesserocr/recognize.py Outdated Show resolved Hide resolved
setup.py Outdated Show resolved Hide resolved
Copy link
Member

@kba kba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great stuff

@kba
Copy link
Member

kba commented Oct 19, 2020

We can probably remove both the old segment-region/line/word and new (all-in-one) segment altogether now that we can configure them via overwrite_* and textequiv_level in recognize. Or we keep the CLI names, but delegate to recognize @kba?

We should keep the segment{,-{region,word,line} CLIs so we don't break existing workflows. If they can delegate to recognize with fixed textequiv_level and overwrite_LEVEL parameters, that would be best. And reduce code duplication, making recognize more complex but much more powerful.

@bertsky
Copy link
Collaborator Author

bertsky commented Oct 19, 2020

If they can delegate to recognize with fixed textequiv_level and overwrite_LEVEL parameters, that would be best.

One problem I still see with such delegation is that we would still need to repeat the tool json. And we would still have to write some thin module instantiating TesserocrRecognize and translating the call.

@kba
Copy link
Member

kba commented Oct 19, 2020

One problem I still see with such delegation is that we would still need to repeat the tool json. And we would still have to write some thin module instantiating TesserocrRecognize and translating the call.

Duplicate info in the ocrd-tool.json is okay I think, it's one place and the user doesn't normally see it. If that is too error-prone, a generator script for ocrd-tool.json might help.

thin module instantiating TesserocrRecognize and translating the call.

Yes, but you could inherit/encapsulate TesserOcrRecognize and do smth like

class TesserocrLineSegmenter(TesserocrRecognize):
  def process(self):
    self.parameter['overwrite_lines'] = True
    self.parameter['textequiv_level'] = "none"
    return super().process(self)

@bertsky
Copy link
Collaborator Author

bertsky commented Oct 19, 2020

Yes, but you could inherit/encapsulate TesserOcrRecognize and do smth like

class TesserocrLineSegmenter(TesserocrRecognize):
  def process(self):
    self.parameter['overwrite_lines'] = True
    self.parameter['textequiv_level'] = "none"
    return super().process(self)

It's not quite as easy I'm afraid. In the constructor, we must pick the derived tool from the tool.json tools (or -h and -J won't work.) So we need to parse the derived tool json and apply it against the input params, but then in process have to fill in the defaults of the original tool json. So the superclass' json must be dumped, a ParameterValidator be constructed for that, and applied against the input params a second time.

Or am I getting it all wrong?

@kba
Copy link
Member

kba commented Oct 20, 2020

How about this:

from ocrd import Processor
from .config import OCRD_TOOL

from .recognize import TesserocrRecognize
TOOL = 'ocrd-tesserocr-segment-line'

class TesserocrSegmentLine(Processor):

    def __init__(self, *args, **kwargs):
        kwargs['ocrd_tool'] = OCRD_TOOL['tools'][TOOL]
        kwargs['version'] = OCRD_TOOL['version']
        super().__init__(*args, **kwargs)

        recognize_kwargs = {**kwargs}
        recognize_kwargs.pop('show_help', None)
        recognize_kwargs.pop('show_version', None)
        recognize_kwargs['parameter'] = {}
        recognize_kwargs['parameter']['overwrite_lines'] = True
        recognize_kwargs['parameter']['textequiv_level'] = "none"
        self.recognizer = TesserocrRecognize(**recognize_kwargs)

    def process(self):
        return self.recognizer.process()

@bertsky
Copy link
Collaborator Author

bertsky commented Oct 20, 2020

How about this:

We still need to do sth about the workspace argument.

Maybe pass None so it also won't chdir a second time? EDIT We do need to pass the workspace, but not in -h/-J/-V context, and still prevent chdir twice

@kba
Copy link
Member

kba commented Oct 20, 2020

How about this:

We still need to do sth about the workspace argument. Maybe pass None so it also won't chdir a second time?

You mean "... and assign workspace to self.recognizer manually after instantiation"?

i.e.

        self.recognizer = TesserocrRecognize(workspace=None, **recognize_kwargs)
        self.recognizer.workspace = self.workspace

That would work if the second chdir bothers you.

@bertsky
Copy link
Collaborator Author

bertsky commented Oct 20, 2020

That would work if the second chdir bothers you.

You still need to pass None for instantiation (workspace is not a kwarg, and chdir would fail [correction: 2nd chdir does no harm, as workspace.directory always comes from resolver.workspace_from_url which does a Path.resolve]).

        self.recognizer.workspace = self.workspace

Yes, in combination with that it should work.

@bertsky
Copy link
Collaborator Author

bertsky commented Oct 20, 2020

Trying to make the segment* CLIs delegate to recognize, I realize that we do still have a conceptual discrepancy: The latter is not capable of incremental annotation, so it does not treat the overwrite_* parameters as the former do. (Even the former were not truly incremental, they just added new segments redundantly. A true incremental behaviour would mask the existing segments.) How should we approach this?

@bertsky
Copy link
Collaborator Author

bertsky commented Oct 20, 2020

Trying to make the segment* CLIs delegate to recognize, I realize that we do still have a conceptual discrepancy: The latter is not capable of incremental annotation, so it does not treat the overwrite_* parameters as the former do. (Even the former were not truly incremental, they just added new segments redundantly. A true incremental behaviour would mask the existing segments.) How should we approach this?

Here's my proposal: Since the existing overwrite_* parameters did not really do anything useful (as they merely allowed redundant segments when False), we just ignore them from now on. The new overwrite_* parameters in recognize are independent. If I add some form of incremental annotation some day, I will slightly change their meaning (still triggering segmentation, but not removing existing segments/areas anymore).

@kba
Copy link
Member

kba commented Oct 20, 2020

we just ignore them from now on

As far as I can grok the consequences, that seems reasonable.

@bertsky
Copy link
Collaborator Author

bertsky commented Oct 20, 2020

The new overwrite_* parameters in recognize are independent. If I add some form of incremental annotation some day, I will slightly change their meaning (still triggering segmentation, but not removing existing segments/areas anymore).

Or, we might already give them different names in recognize, say detect_regions, detect_lines, detect_words (and have no overwrite_* for now). That way, once we do add incremental annotation, we can still turn it off via overwrite_regions=True etc.

Ignoring the old overwrite_* would be as simple as suppressing them in the parameter dict during delegation.

@kba
Copy link
Member

kba commented Oct 20, 2020

The new overwrite_* parameters in recognize are independent. If I add some form of incremental annotation some day, I will slightly change their meaning (still triggering segmentation, but not removing existing segments/areas anymore).

Or, we might already give them different names in recognize, say detect_regions, detect_lines, detect_words (and have no overwrite_* for now). That way, once we do add incremental annotation, we can still turn it off via overwrite_regions=True etc.

detect_LEVEL is certainly more intuitive than overwrite_LEVEL with a different meaning in different contexts.

@bertsky
Copy link
Collaborator Author

bertsky commented Oct 20, 2020

detect_LEVEL is certainly more intuitive than overwrite_LEVEL with a different meaning in different contexts.

Agreed. Last one: detect_regions or segment_regions? (Text recognition might also be called "detection" in a sense.)

EDIT Ah, but that opens up the pit of "segment regions into lines" vs "segment regions in pages" again...

@kba
Copy link
Member

kba commented Oct 20, 2020

I think detect_LEVEL is better than segment_LEVEL because in the former it is clear that LEVEL is what is being detected, whereas in the latter, it could mean "find LEVEL+1 in LEVEL".

@bertsky
Copy link
Collaborator Author

bertsky commented Oct 20, 2020

Ignoring the old overwrite_* would be as simple as suppressing them in the parameter dict during delegation.

Since this part of the discussion was all about backwards-compatibility, I think it would be fair if we made these parameters "true-only" in the tool json (via "enum": [true]). That way, if anyone used False in their workflow, they will know they have to look at this and change their configuration.

@kba
Copy link
Member

kba commented Oct 20, 2020

That would work if the second chdir bothers you.

You still need to pass None for instantiation (workspace is not a kwarg, and chdir would fail).

Sure, copy/paste

        self.recognizer.workspace = self.workspace

Yes, in combination with that it should work.

Ignoring the old overwrite_* would be as simple as suppressing them in the parameter dict during delegation.

Since this part of the discussion was all about backwards-compatibility, I think it would be fair if we made these parameters "true-only" in the tool json (via "enum": [true]). That way, if anyone used False in their workflow, they will know they have to look at this and change their configuration.

👍 I've never used enums for anything other than strings acoording to https://json-schema.org/understanding-json-schema/reference/generic.html#enumerated-values {"type": boolean, "enum": [true]} is indeed a valid construct. Nice, learned something new.

@bertsky
Copy link
Collaborator Author

bertsky commented Oct 20, 2020

I've never used enums for anything other than strings acoording to https://json-schema.org/understanding-json-schema/reference/generic.html#enumerated-values {"type": boolean, "enum": [true]} is indeed a valid construct. Nice, learned something new.

Yes, and it will give these hilarious tautological responses:

Invalid parameters ['[overwrite_words] False is not one of [True]']

@bertsky
Copy link
Collaborator Author

bertsky commented Nov 6, 2020

I just have run a new test with latest ocrd_all + latest version of this pull request and see a major regression with many lines of text missing in the final result (compare old with new).

Thanks @stweil! I forgot I need to jump across RIL_PARA when I want text lines in text blocks.

@kba
Copy link
Member

kba commented Nov 17, 2020

I just have run a new test with latest ocrd_all + latest version of this pull request and see a major regression with many lines of text missing in the final result (compare old with new).

Thanks @stweil! I forgot I need to jump across RIL_PARA when I want text lines in text blocks.

That was fixed in 1ac011e right?

  1. keep image padding, but set defaults for both kinds of padding to 0px (as before)
  2. remove image padding, start a new PR for it (and get involved in the GT measurements there)
  3. expose a different parameter name for image padding, say image_padding, which defaults to 0px

What was the resolution here?

@bertsky
Copy link
Collaborator Author

bertsky commented Nov 17, 2020

Thanks @stweil! I forgot I need to jump across RIL_PARA when I want text lines in text blocks.

That was fixed in 1ac011e right?

Yes!

  1. keep image padding, but set defaults for both kinds of padding to 0px (as before)
  2. remove image padding, start a new PR for it (and get involved in the GT measurements there)
  3. expose a different parameter name for image padding, say image_padding, which defaults to 0px

What was the resolution here?

87e6444 implements option 1, which is okay with me, too.

We'll need some announcements and updates to the workflow guides when/after merging this, probably a dedicated call. Same for cisocrgroup/ocrd_cis#77 (which is not quite ready yet). That's why I waited for you. We have to stir everything up a bit, but let's at least minimise the rumble by doing it all at once.

@kba
Copy link
Member

kba commented Nov 17, 2020

We'll need some announcements and updates to the workflow guides when/after merging this, probably a dedicated call. Same for cisocrgroup/ocrd_cis#77 (which is not quite ready yet).

OK, let's discuss what needs to change documentation-wise in the call thursday and plan to have this, the ocrd_cis resegment PR and depending documentation changs ready for the open tech call next week? Is that realistic?

@bertsky
Copy link
Collaborator Author

bertsky commented Nov 17, 2020

OK, let's discuss what needs to change documentation-wise in the call thursday and plan to have this, the ocrd_cis resegment PR and depending documentation changs ready for the open tech call next week? Is that realistic?

Yes, agreed!

@bertsky
Copy link
Collaborator Author

bertsky commented Nov 19, 2020

Hmm, sorry for adding yet another verse to the epic, but I must revisit the decision to give up the overwrite_* parameters (allowing only True from now on):

Since the existing overwrite_* parameters did not really do anything useful (as they merely allowed redundant segments when False), we just ignore them from now on.

I forgot that there is in fact one valid use-case for these parameters being false, which is input which sometimes has sub-segments, but sometimes does not (because the previous segmentation failed on that segment). For example, segment-line with overwrite_lines=False does make sense if I expect that regions sometimes contain lines already (where I don't want to add more), but sometimes do not (where I want to find some). That's not the same as the old behaviour (which would add segments regardless of existing ones), and not the same as the more desirable incremental behaviour (which would add segments only where no existing ones are located), but at least one reason for still allowing False even now.

Should I make the respective changes?

@kba
Copy link
Member

kba commented Nov 19, 2020

ok, if we do need overwrite_* to allow users to only "fill in the blanks" in inconsistently segmented data, then indeed please reintroduce them.

- recognize: introduce `overwrite_text` param (default still `true`)
  Allow merely adding TextEquiv as non-first result.
- recognize: introduce `overwrite_segments` param (default still `false`)
  Whenever `segmentation_level` warrants overwriting existing segments,
  allow keeping and descending them.
- recognize/segment: Use the same logic for existing tables as other segments:
  if cells exist already, unless `overwrite` is true, keep and descend them.
- segment-{region,table,line,word}: delegate old `overwrite_*` param
  to new `overwrite_segment` in `recognize`
- segment-table: rename param `overwrite_regions` to `overwrite_cells`
@bertsky bertsky merged commit 056d30d into OCR-D:master Dec 1, 2020
@kba
Copy link
Member

kba commented Dec 2, 2020

🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants