-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add fontshape processor and all-in-one segmentation #158
Conversation
- new processor `segment`, using the `AnalyseLayout` iterator for all hierarchy levels at once (avoiding textline overlap between regions); this sidelines the existing isolated processors `segment-{region,line,word}` and ultimately replace them; postprocessing by polygonalisation of text lines and shrinking/clipping of text regions is still necessary though - `recognize`, `segment-line`, `segment-word`: use fully recursive region iterator - new processor `fontshape`, using pre-LSTM recognition models to query `WordFontAttributes` for all word bboxes and annotate them via `TextStyle`
- use `overwrite_regions/lines/words` to enable segmentation - use `textequiv_level=none` to disable recognition - use Tesseract's `AnalyseLayout` for segmentation-only, but `Recognize` for segmentation and recognition - Tesseract gets image at the highest necessary hierarchy level, one shared iterator from there on - `padding` means raw pixels for segmentation and artificial pixels for existing images - integrate all page segmentation parameters (`sparse_text`, `find_tables`, `block_polygons` ...)
Codecov Report
@@ Coverage Diff @@
## master #158 +/- ##
==========================================
+ Coverage 37.73% 40.58% +2.85%
==========================================
Files 9 11 +2
Lines 1023 1126 +103
Branches 216 236 +20
==========================================
+ Hits 386 457 +71
- Misses 565 585 +20
- Partials 72 84 +12
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great stuff
We should keep the |
One problem I still see with such delegation is that we would still need to repeat the tool json. And we would still have to write some thin module instantiating |
Duplicate info in the ocrd-tool.json is okay I think, it's one place and the user doesn't normally see it. If that is too error-prone, a generator script for ocrd-tool.json might help.
Yes, but you could inherit/encapsulate class TesserocrLineSegmenter(TesserocrRecognize):
def process(self):
self.parameter['overwrite_lines'] = True
self.parameter['textequiv_level'] = "none"
return super().process(self) |
It's not quite as easy I'm afraid. In the constructor, we must pick the derived tool from the tool.json Or am I getting it all wrong? |
How about this: from ocrd import Processor
from .config import OCRD_TOOL
from .recognize import TesserocrRecognize
TOOL = 'ocrd-tesserocr-segment-line'
class TesserocrSegmentLine(Processor):
def __init__(self, *args, **kwargs):
kwargs['ocrd_tool'] = OCRD_TOOL['tools'][TOOL]
kwargs['version'] = OCRD_TOOL['version']
super().__init__(*args, **kwargs)
recognize_kwargs = {**kwargs}
recognize_kwargs.pop('show_help', None)
recognize_kwargs.pop('show_version', None)
recognize_kwargs['parameter'] = {}
recognize_kwargs['parameter']['overwrite_lines'] = True
recognize_kwargs['parameter']['textequiv_level'] = "none"
self.recognizer = TesserocrRecognize(**recognize_kwargs)
def process(self):
return self.recognizer.process() |
We still need to do sth about the workspace argument.
|
You mean "... and assign i.e. self.recognizer = TesserocrRecognize(workspace=None, **recognize_kwargs)
self.recognizer.workspace = self.workspace That would work if the second chdir bothers you. |
You still need to pass
Yes, in combination with that it should work. |
Trying to make the |
Here's my proposal: Since the existing |
As far as I can grok the consequences, that seems reasonable. |
Or, we might already give them different names in Ignoring the old |
|
Agreed. Last one: EDIT Ah, but that opens up the pit of "segment regions into lines" vs "segment regions in pages" again... |
I think |
Since this part of the discussion was all about backwards-compatibility, I think it would be fair if we made these parameters "true-only" in the tool json (via |
Sure, copy/paste
👍 I've never used enums for anything other than strings acoording to https://json-schema.org/understanding-json-schema/reference/generic.html#enumerated-values |
Yes, and it will give these hilarious tautological responses:
|
967af65
to
39fa0c4
Compare
That was fixed in 1ac011e right?
What was the resolution here? |
Yes!
87e6444 implements option 1, which is okay with me, too. We'll need some announcements and updates to the workflow guides when/after merging this, probably a dedicated call. Same for cisocrgroup/ocrd_cis#77 (which is not quite ready yet). That's why I waited for you. We have to stir everything up a bit, but let's at least minimise the rumble by doing it all at once. |
OK, let's discuss what needs to change documentation-wise in the call thursday and plan to have this, the ocrd_cis resegment PR and depending documentation changs ready for the open tech call next week? Is that realistic? |
Yes, agreed! |
Hmm, sorry for adding yet another verse to the epic, but I must revisit the decision to give up the
I forgot that there is in fact one valid use-case for these parameters being false, which is input which sometimes has sub-segments, but sometimes does not (because the previous segmentation failed on that segment). For example, segment-line with Should I make the respective changes? |
ok, if we do need |
- recognize: introduce `overwrite_text` param (default still `true`) Allow merely adding TextEquiv as non-first result. - recognize: introduce `overwrite_segments` param (default still `false`) Whenever `segmentation_level` warrants overwriting existing segments, allow keeping and descending them. - recognize/segment: Use the same logic for existing tables as other segments: if cells exist already, unless `overwrite` is true, keep and descend them. - segment-{region,table,line,word}: delegate old `overwrite_*` param to new `overwrite_segment` in `recognize` - segment-table: rename param `overwrite_regions` to `overwrite_cells`
🎉 |
We can probably remove both the old
segment-region/line/word
and new (all-in-one)segment
altogether now that we can configure them viaoverwrite_*
andtextequiv_level
inrecognize
. Or we keep the CLI names, but delegate torecognize
@kba?