move to AlternativeImage feature selectors in OCR-D/core#294: #75

bertsky · 2019-08-28T17:29:42Z

all: use second output position as fileGrp USE to produce
AlternativeImage
all: rid of MetadataItem/Labels-related FIXME:
with the updated PAGE model, we can now use
@externalModel and @externalId
all: use OcrdExif.resolution instead of xResolution
all: create images with monotonically growing @comments
(features)
crop: use ocrd_utils.crop_image instead of PIL.Image.crop
crop: fix bug when trying to access page_image if there
are already region coordinates that we are ignoring
crop: filter images already deskewed and cropped!
(we must crop ourselves, and deskewing can not happen
until afterwards)
deskew: fix bugs in configuration-dependent corner cases
related to whether deskewing has already been applied
(on the page or region level):
- for the page image, never use images already rotated
  (both for page level and region level processing,
  but for the region level, do rotate images ad hoc
  if @orientation is present on the page level)
- for the region image, never use images already rotated
  (except for our own page-level rotation)
segment-region: forgot to add feature "cropped" when
producing cropped images

- all: use second output position as fileGrp USE to produce AlternativeImage - all: rid of MetadataItem/Labels-related FIXME: with the updated PAGE model, we can now use @externalModel and @externalId - all: use OcrdExif.resolution instead of xResolution - all: create images with monotonically growing @comments (features) - crop: use ocrd_utils.crop_image instead of PIL.Image.crop - crop: fix bug when trying to access page_image if there are already region coordinates that we are ignoring - crop: filter images already deskewed and cropped! (we must crop ourselves, and deskewing can not happen until afterwards) - deskew: fix bugs in configuration-dependent corner cases related to whether deskewing has already been applied (on the page or region level): - for the page image, never use images already rotated (both for page level and region level processing, but for the region level, do rotate images ad hoc if @orientation is present on the page level) - for the region image, never use images already rotated (except for our own page-level rotation) - segment-region: forgot to add feature "cropped" when producing cropped images

bertsky · 2019-08-28T17:31:37Z

Also fixes #61.

codecov · 2019-08-28T17:32:46Z

Codecov Report

Merging #75 into master will decrease coverage by 0.82%.
The diff coverage is 22.72%.

@@            Coverage Diff             @@
##           master      #75      +/-   ##
==========================================
- Coverage   47.81%   46.99%   -0.83%     
==========================================
  Files           8        8              
  Lines         688      715      +27     
  Branches      130      134       +4     
==========================================
+ Hits          329      336       +7     
- Misses        326      346      +20     
  Partials       33       33

Impacted Files	Coverage Δ
ocrd_tesserocr/binarize.py	`22.05% <0%> (-0.67%)`	⬇️
ocrd_tesserocr/deskew.py	`15.84% <3.44%> (-1.94%)`	⬇️
ocrd_tesserocr/crop.py	`13.76% <4%> (-1.78%)`	⬇️
ocrd_tesserocr/segment_word.py	`81.13% <66.66%> (ø)`	⬆️
ocrd_tesserocr/recognize.py	`53.5% <66.66%> (+1%)`	⬆️
ocrd_tesserocr/segment_line.py	`80.39% <66.66%> (ø)`	⬆️
ocrd_tesserocr/segment_region.py	`73.22% <75%> (+0.11%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9e5407d...4176747. Read the comment docs.

wrznr

Only the typos are must have. Not sure about the multiple file group logistics.

Technically, the changes proposed here carry over to OCRopus and anybaseocr. Does it make sense to add abstract wrappers to core for the single processing steps (i.e. ProcessorCrop) from which the module project implementations could derive?

CHANGELOG.md

ocrd_tesserocr/binarize.py

ocrd_tesserocr/crop.py

ocrd_tesserocr/segment_region.py

bertsky · 2019-09-06T07:45:38Z

Technically, the changes proposed here carry over to OCRopus and anybaseocr. Does it make sense to add abstract wrappers to core for the single processing steps (i.e. ProcessorCrop) from which the module project implementations could derive?

Good idea. This would prevent making the same errors elsewhere, and avoid copying code. But it would probably be difficult to encapsulate the various fixed parts of the process method, with custom code scattered across different conditionals and loop bodies.

kba · 2019-09-06T09:40:46Z

setup.py

@@ -16,7 +16,7 @@

 setup(
    name='ocrd_tesserocr',
-    version='0.4.0',
+    version='0.4.1',


Better to do that in explicit version bumping commits.

Understood. @kba, should I revert this in the PR? Do you want to do the merge and the versioning commit along with the new release yourself?

kba · 2019-09-06T09:51:54Z

Good idea. This would prevent making the same errors elsewhere, and avoid copying code. But it would probably be difficult to encapsulate the various fixed parts of the process method, with custom code scattered across different conditionals and loop bodies.

Yeah, that would be tricky without changing the API and breaking existing code. It's a neat idea though and I wish we had a wrapper around the process method that could accomodate specialization and do startup/cleanup work but we haven't.

(e.g. requiring processors to implement a protected _process method that the base class' process method would call. But it's too late for that now)

This reverts commit 55d1d87. (Current Tesseract LSTM models all expect to see binarized images, because they were trained on such data. This may, however, change in the future.)

bertsky · 2019-09-21T09:54:24Z

The last 2 commits explained:

I noticed that Tesseract internally uses 8 bit grayscale as feed for the LSTM models instead of its own Otsu binarization.

So I gathered its binarization is only needed for the non-LSTM models and layout analysis, and therefore added feature_filter='binarized' (i.e. requested raw images from the API). But then it turned out that existing models (in fact, existing training procedures) also feed binarized data. So the network would be quite perplexed to see grayscale input at runtime. This is also what I could validate by measurements on GT (tentatively): CER shoots up when using raw images for SetImage. And it shoots up even more when cropped/rotated/masked areas do not get filled with background colour but with white or transparent (since transparency is reduced to white internally). So the network is more confused about white background re-appearing in a grayscale image than white disappearing completely at runtime.

Anyway, the revert is necessary to keep up the expectations of current models, but the original commit could be re-activated when we have a different training procedure!

bertsky · 2019-09-23T15:48:18Z

Here are my CER measurements (in percent) on 2 GT bags with textual annotation, in a workflow configuration similar to this (i.e. including Olena Wolf or Ocropy nlbin-nrm binarization, Ocropy page-level deskewing, clipping and resegmentation, and dewarping).

Tesseract model	source	trained on
Fraktur	tessdata_best (Google)	binarized artificial images
GT4HistOCR	tesstrain (UB Mannheim)	greyscale-normalized scanned images

On euler_rechenkunst01_1738 (a bag with JPEG artifacts and lower resolution):

input	Fraktur	GT4HistOCR
binarized (Wolf)	9.7	11.9
raw with transparent/white background	19.7	9.9
raw with estimated background color	12.9	61.3
greyscale-normalized (ocropy-nlbin)	9.4	9.8

Results are similar in tendency for weigel_gnothi02_1618 (which is cleaner):

input	Fraktur	GT4HistOCR
binarized (Wolf)	14.0	7.3
raw with transparent/white background	36.5	8.3
raw with estimated background color	13.8	58.6
greyscale-normalized (ocropy-nlbin)	13.5	7.0

So Tesseract gets perplexed …

on a standard model (trained binarized):
- … not at all by greyscale normalized line images,
- … somewhat by median background in unnormalized greyscale line images,
- … extremely by white background in unnormalized greyscale line images,
on GT4HistOCR model (trained grayscale_normalized):
- … somewhat by binarized line images,
- … a little by white background in unnormalized greyscale line images,
- … extremely by median background in unnormalized greyscale line images

wrznr · 2019-09-24T06:49:55Z

It is striking that both, the stock and the GT4HistOCR model, perform so poorly! CER between 7 and 9 % is a) simply not good enough b) way below the numbers @stweil reported at OCR-D developer workshop.

stweil · 2019-09-24T07:59:12Z

@wrznr, CER often needs a closer look, especially what kind of errors did occur (maybe missing normalization?).

@bertsky, how can I reproduce your setting?

bertsky · 2019-09-24T08:20:49Z

It is striking that both, the stock and the GT4HistOCR model, perform so poorly! CER between 7 and 9 % is a) simply not good enough b) way below the numbers @stweil reported at OCR-D developer workshop.

True! That was also one of my messages at the workshop. Generally, I am quite certain this is due to the relatively bad quality of our GT:

very imprecise segmentation (often too coarse with much overlap, sometimes too tight with cutoff) and
suboptimal image quality (often noisy/skewed/warped, sometimes compression artifacts).

Despite all the preprocessing and resegmentation efforts, we are not able to squeeze less than 11% CER out of the whole dataset with the stock models. And I don't believe you would get much better results if you trained/finetuned a dedicated OCR model on our GT. But maybe @stweil wants to disprove that?

GT4HistOCR also looks much cleaner. If really all they did for preprocessing was running ocropus-nlbin and ocropus-gpageseg, then their raw data already were a lot easier to begin with. (And as already argued, this might be simply the result of filtering out the lesser-quality lines.)

So, I think the situation demands:

better despeckling: At the moment, we only have ocrd-cis-ocropy-denoise (which wraps ocrolib.remove_noise for black noise, so maybe we at least should make it work for white noise as well: @Doreenruirui, I think I might just do that). But really it's time we finally got hold of some data-driven despeckling: @n00blet @mjenckel, can you offer something?
better deskewing: We currently rely on ocrd-cis-ocropy-deskew and ocrd-tesserocr-deskew, both of which improve overall results a little, but often enough fail miserably for no good (visually convincing) reason. Again, it's time we finally got some data-driven deskewing: I was struggling to get ocrorot running, but I am sure there must be more tools out there. @n00blet @mjenckel, can you offer something?
even better binarization
semi-interactively improving GT segmentation (aided by our existing clipping/resegmentation/repair tools)

bertsky · 2019-09-24T08:23:40Z

Example from our preprocessing/resegmentation pipeline: cleanly separated from neighbours, but still quite noisy

Example from GT4HistOCR: very clean, but skewed/warped

wrznr · 2019-09-24T08:43:55Z

@bertsky @stweil So our impression

And as already argued, this might be simply the result of filtering out the lesser-quality lines.

that GT4HistOCR is suboptimal for OCR training is real.

And I don't believe you would get much better results if you trained/finetuned a dedicated OCR model on our GT.

I am with you. But we could try to noise GT4HistOCR. Where could we further discuss this issue? (This PR is clearly not the right place.) A dedicated issue in tesstrain?

bertsky · 2019-09-24T11:33:07Z

@stweil The above link should be workable on the current versions. But it's probably best to use the PRs I have used: OCR-D/core#311 and #75 and cisocrgroup/ocrd_cis#16 and master cor-asv-ann (for evaluation).

CER measurement in cor-asv-ann-evaluate works as documented: vanilla Levenshtein, no normalization. This is best for comparability, results might look better (and be more fair) with different metrics and normalizations. The package offers some, but results look quite similar with other settings, even NFC.

bertsky · 2019-09-24T11:36:16Z

But we could try to noise GT4HistOCR. Where could we further discuss this issue? (This PR is clearly not the right place.) A dedicated issue in tesstrain?

I agree, both degradation and binarization should be employed to make GT4HistOCR models robust.

wrznr · 2019-09-26T12:39:02Z

@bertsky Is there PR in core which blocks merging here?

bertsky · 2019-09-26T13:03:18Z

Where could we further discuss this issue? (This PR is clearly not the right place.) A dedicated issue in tesstrain?

@stweil Is tesseract-ocr/tesstrain#73 the right place for this? Or better open a new issue strictly about binarization/augmentation (not specific to GT4HistOCR)?

BTW, according to my measurements, Fraktur performs similar on GT4HistOCR data, although not quite as bad as our GT. But of course, it really depends on the subcorpus/century:

dta19 and RefCorpus-ENHG-Incunabula and Kallimachos: 7.0% CER
dta19 only: 4.9% CER
Since our GT is more diverse / has less 19th century share, worse results can be expected (independent of bad segmentation / preprocessing).

bertsky · 2019-09-26T13:05:32Z

Is there PR in core which blocks merging here?

No, not really. OCR-D/core#311 is related, but not a dependency. Thanks, I will merge this for now.

bertsky requested review from kba and wrznr August 28, 2019 17:29

bertsky added bug Something isn't working enhancement New feature or request labels Aug 28, 2019

segment-region: must reset reading order as None

fcbfa26

wrznr suggested changes Sep 6, 2019

View reviewed changes

typos and pylint

0c09b94

wrznr reviewed Sep 6, 2019

View reviewed changes

ocrd_tesserocr/segment_region.py Outdated Show resolved Hide resolved

use RST backticks consistently

f8c99a1

wrznr approved these changes Sep 6, 2019

View reviewed changes

kba approved these changes Sep 6, 2019

View reviewed changes

bertsky added 4 commits September 10, 2019 12:14

deskew: also annotate 'deskewed' if angle<1°

978c348

allow more recent tesserocr

a9cbf7b

recognize: do not use binarized images

55d1d87

Revert "recognize: do not use binarized images"

4176747

This reverts commit 55d1d87. (Current Tesseract LSTM models all expect to see binarized images, because they were trained on such data. This may, however, change in the future.)

bertsky mentioned this pull request Sep 23, 2019

improve handling of rotation OCR-D/core#311

Merged

bertsky merged commit 02d5c2b into OCR-D:master Sep 26, 2019

bertsky deleted the use-core-alternativeimage-comments-selectors branch October 2, 2020 19:44

bertsky mentioned this pull request Sep 9, 2021

RFC: allow flexible or better binarization tesseract-ocr/tesseract#3083

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

move to AlternativeImage feature selectors in OCR-D/core#294: #75

move to AlternativeImage feature selectors in OCR-D/core#294: #75

bertsky commented Aug 28, 2019

bertsky commented Aug 28, 2019

codecov bot commented Aug 28, 2019 •

edited

Loading

wrznr left a comment

bertsky commented Sep 6, 2019

kba Sep 6, 2019

bertsky Sep 16, 2019 •

edited

Loading

kba commented Sep 6, 2019

bertsky commented Sep 21, 2019 •

edited

Loading

bertsky commented Sep 23, 2019

wrznr commented Sep 24, 2019

stweil commented Sep 24, 2019

bertsky commented Sep 24, 2019 •

edited

Loading

bertsky commented Sep 24, 2019

wrznr commented Sep 24, 2019

bertsky commented Sep 24, 2019

bertsky commented Sep 24, 2019

wrznr commented Sep 26, 2019

bertsky commented Sep 26, 2019

bertsky commented Sep 26, 2019

move to AlternativeImage feature selectors in OCR-D/core#294: #75

move to AlternativeImage feature selectors in OCR-D/core#294: #75

Conversation

bertsky commented Aug 28, 2019

bertsky commented Aug 28, 2019

codecov bot commented Aug 28, 2019 • edited Loading

Codecov Report

wrznr left a comment

Choose a reason for hiding this comment

bertsky commented Sep 6, 2019

kba Sep 6, 2019

Choose a reason for hiding this comment

bertsky Sep 16, 2019 • edited Loading

Choose a reason for hiding this comment

kba commented Sep 6, 2019

bertsky commented Sep 21, 2019 • edited Loading

bertsky commented Sep 23, 2019

wrznr commented Sep 24, 2019

stweil commented Sep 24, 2019

bertsky commented Sep 24, 2019 • edited Loading

bertsky commented Sep 24, 2019

wrznr commented Sep 24, 2019

bertsky commented Sep 24, 2019

bertsky commented Sep 24, 2019

wrznr commented Sep 26, 2019

bertsky commented Sep 26, 2019

bertsky commented Sep 26, 2019

codecov bot commented Aug 28, 2019 •

edited

Loading

bertsky Sep 16, 2019 •

edited

Loading

bertsky commented Sep 21, 2019 •

edited

Loading

bertsky commented Sep 24, 2019 •

edited

Loading