Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

move to AlternativeImage feature selectors in OCR-D/core#294: #75

Merged

Conversation

bertsky
Copy link
Collaborator

@bertsky bertsky commented Aug 28, 2019

  • all: use second output position as fileGrp USE to produce
    AlternativeImage
  • all: rid of MetadataItem/Labels-related FIXME:
    with the updated PAGE model, we can now use
    @externalModel and @externalId
  • all: use OcrdExif.resolution instead of xResolution
  • all: create images with monotonically growing @comments
    (features)
  • crop: use ocrd_utils.crop_image instead of PIL.Image.crop
  • crop: fix bug when trying to access page_image if there
    are already region coordinates that we are ignoring
  • crop: filter images already deskewed and cropped!
    (we must crop ourselves, and deskewing can not happen
    until afterwards)
  • deskew: fix bugs in configuration-dependent corner cases
    related to whether deskewing has already been applied
    (on the page or region level):
    • for the page image, never use images already rotated
      (both for page level and region level processing,
      but for the region level, do rotate images ad hoc
      if @orientation is present on the page level)
    • for the region image, never use images already rotated
      (except for our own page-level rotation)
  • segment-region: forgot to add feature "cropped" when
    producing cropped images

- all: use second output position as fileGrp USE to produce
  AlternativeImage
- all: rid of MetadataItem/Labels-related FIXME:
  with the updated PAGE model, we can now use
  @externalModel and @externalId
- all: use OcrdExif.resolution instead of xResolution
- all: create images with monotonically growing @comments
  (features)
- crop: use ocrd_utils.crop_image instead of PIL.Image.crop
- crop: fix bug when trying to access page_image if there
  are already region coordinates that we are ignoring
- crop: filter images already deskewed and cropped!
  (we must crop ourselves, and deskewing can not happen
   until afterwards)
- deskew: fix bugs in configuration-dependent corner cases
  related to whether deskewing has already been applied
  (on the page or region level):
  - for the page image, never use images already rotated
    (both for page level and region level processing,
     but for the region level, do rotate images ad hoc
     if @orientation is present on the page level)
  - for the region image, never use images already rotated
    (except for our own page-level rotation)
- segment-region: forgot to add feature "cropped" when
  producing cropped images
@bertsky bertsky requested review from kba and wrznr August 28, 2019 17:29
@bertsky bertsky added bug Something isn't working enhancement New feature or request labels Aug 28, 2019
@bertsky
Copy link
Collaborator Author

bertsky commented Aug 28, 2019

Also fixes #61.

@codecov
Copy link

codecov bot commented Aug 28, 2019

Codecov Report

Merging #75 into master will decrease coverage by 0.82%.
The diff coverage is 22.72%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #75      +/-   ##
==========================================
- Coverage   47.81%   46.99%   -0.83%     
==========================================
  Files           8        8              
  Lines         688      715      +27     
  Branches      130      134       +4     
==========================================
+ Hits          329      336       +7     
- Misses        326      346      +20     
  Partials       33       33
Impacted Files Coverage Δ
ocrd_tesserocr/binarize.py 22.05% <0%> (-0.67%) ⬇️
ocrd_tesserocr/deskew.py 15.84% <3.44%> (-1.94%) ⬇️
ocrd_tesserocr/crop.py 13.76% <4%> (-1.78%) ⬇️
ocrd_tesserocr/segment_word.py 81.13% <66.66%> (ø) ⬆️
ocrd_tesserocr/recognize.py 53.5% <66.66%> (+1%) ⬆️
ocrd_tesserocr/segment_line.py 80.39% <66.66%> (ø) ⬆️
ocrd_tesserocr/segment_region.py 73.22% <75%> (+0.11%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9e5407d...4176747. Read the comment docs.

Copy link
Contributor

@wrznr wrznr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only the typos are must have. Not sure about the multiple file group logistics.

Technically, the changes proposed here carry over to OCRopus and anybaseocr. Does it make sense to add abstract wrappers to core for the single processing steps (i.e. ProcessorCrop) from which the module project implementations could derive?

CHANGELOG.md Outdated Show resolved Hide resolved
ocrd_tesserocr/binarize.py Show resolved Hide resolved
ocrd_tesserocr/binarize.py Outdated Show resolved Hide resolved
ocrd_tesserocr/binarize.py Show resolved Hide resolved
ocrd_tesserocr/crop.py Show resolved Hide resolved
ocrd_tesserocr/crop.py Show resolved Hide resolved
ocrd_tesserocr/crop.py Show resolved Hide resolved
@bertsky
Copy link
Collaborator Author

bertsky commented Sep 6, 2019

Technically, the changes proposed here carry over to OCRopus and anybaseocr. Does it make sense to add abstract wrappers to core for the single processing steps (i.e. ProcessorCrop) from which the module project implementations could derive?

Good idea. This would prevent making the same errors elsewhere, and avoid copying code. But it would probably be difficult to encapsulate the various fixed parts of the process method, with custom code scattered across different conditionals and loop bodies.

@@ -16,7 +16,7 @@

setup(
name='ocrd_tesserocr',
version='0.4.0',
version='0.4.1',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to do that in explicit version bumping commits.

Copy link
Collaborator Author

@bertsky bertsky Sep 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood. @kba, should I revert this in the PR? Do you want to do the merge and the versioning commit along with the new release yourself?

@kba
Copy link
Member

kba commented Sep 6, 2019

Good idea. This would prevent making the same errors elsewhere, and avoid copying code. But it would probably be difficult to encapsulate the various fixed parts of the process method, with custom code scattered across different conditionals and loop bodies.

Yeah, that would be tricky without changing the API and breaking existing code. It's a neat idea though and I wish we had a wrapper around the process method that could accomodate specialization and do startup/cleanup work but we haven't.

(e.g. requiring processors to implement a protected _process method that the base class' process method would call. But it's too late for that now)

This reverts commit 55d1d87.

(Current Tesseract LSTM models all expect to see binarized images,
 because they were trained on such data. This may, however, change
 in the future.)
@bertsky
Copy link
Collaborator Author

bertsky commented Sep 21, 2019

The last 2 commits explained:

I noticed that Tesseract internally uses 8 bit grayscale as feed for the LSTM models instead of its own Otsu binarization.

So I gathered its binarization is only needed for the non-LSTM models and layout analysis, and therefore added feature_filter='binarized' (i.e. requested raw images from the API). But then it turned out that existing models (in fact, existing training procedures) also feed binarized data. So the network would be quite perplexed to see grayscale input at runtime. This is also what I could validate by measurements on GT (tentatively): CER shoots up when using raw images for SetImage. And it shoots up even more when cropped/rotated/masked areas do not get filled with background colour but with white or transparent (since transparency is reduced to white internally). So the network is more confused about white background re-appearing in a grayscale image than white disappearing completely at runtime.

Anyway, the revert is necessary to keep up the expectations of current models, but the original commit could be re-activated when we have a different training procedure!

@bertsky
Copy link
Collaborator Author

bertsky commented Sep 23, 2019

Here are my CER measurements (in percent) on 2 GT bags with textual annotation, in a workflow configuration similar to this (i.e. including Olena Wolf or Ocropy nlbin-nrm binarization, Ocropy page-level deskewing, clipping and resegmentation, and dewarping).

Tesseract model source trained on
Fraktur tessdata_best (Google) binarized artificial images
GT4HistOCR tesstrain (UB Mannheim) greyscale-normalized scanned images

On euler_rechenkunst01_1738 (a bag with JPEG artifacts and lower resolution):

input Fraktur GT4HistOCR
binarized (Wolf) 9.7 11.9
raw with transparent/white background 19.7 9.9
raw with estimated background color 12.9 61.3
greyscale-normalized (ocropy-nlbin) 9.4 9.8

Results are similar in tendency for weigel_gnothi02_1618 (which is cleaner):

input Fraktur GT4HistOCR
binarized (Wolf) 14.0 7.3
raw with transparent/white background 36.5 8.3
raw with estimated background color 13.8 58.6
greyscale-normalized (ocropy-nlbin) 13.5 7.0

So Tesseract gets perplexed …

  • on a standard model (trained binarized):
    • … not at all by greyscale normalized line images,
    • … somewhat by median background in unnormalized greyscale line images,
    • … extremely by white background in unnormalized greyscale line images,
  • on GT4HistOCR model (trained grayscale_normalized):
    • … somewhat by binarized line images,
    • … a little by white background in unnormalized greyscale line images,
    • … extremely by median background in unnormalized greyscale line images

@wrznr
Copy link
Contributor

wrznr commented Sep 24, 2019

It is striking that both, the stock and the GT4HistOCR model, perform so poorly! CER between 7 and 9 % is a) simply not good enough b) way below the numbers @stweil reported at OCR-D developer workshop.

@stweil
Copy link
Contributor

stweil commented Sep 24, 2019

@wrznr, CER often needs a closer look, especially what kind of errors did occur (maybe missing normalization?).

@bertsky, how can I reproduce your setting?

@bertsky
Copy link
Collaborator Author

bertsky commented Sep 24, 2019

It is striking that both, the stock and the GT4HistOCR model, perform so poorly! CER between 7 and 9 % is a) simply not good enough b) way below the numbers @stweil reported at OCR-D developer workshop.

True! That was also one of my messages at the workshop. Generally, I am quite certain this is due to the relatively bad quality of our GT:

  • very imprecise segmentation (often too coarse with much overlap, sometimes too tight with cutoff) and
  • suboptimal image quality (often noisy/skewed/warped, sometimes compression artifacts).

Despite all the preprocessing and resegmentation efforts, we are not able to squeeze less than 11% CER out of the whole dataset with the stock models. And I don't believe you would get much better results if you trained/finetuned a dedicated OCR model on our GT. But maybe @stweil wants to disprove that?

GT4HistOCR also looks much cleaner. If really all they did for preprocessing was running ocropus-nlbin and ocropus-gpageseg, then their raw data already were a lot easier to begin with. (And as already argued, this might be simply the result of filtering out the lesser-quality lines.)

So, I think the situation demands:

  • better despeckling: At the moment, we only have ocrd-cis-ocropy-denoise (which wraps ocrolib.remove_noise for black noise, so maybe we at least should make it work for white noise as well: @Doreenruirui, I think I might just do that). But really it's time we finally got hold of some data-driven despeckling: @n00blet @mjenckel, can you offer something?

  • better deskewing: We currently rely on ocrd-cis-ocropy-deskew and ocrd-tesserocr-deskew, both of which improve overall results a little, but often enough fail miserably for no good (visually convincing) reason. Again, it's time we finally got some data-driven deskewing: I was struggling to get ocrorot running, but I am sure there must be more tools out there. @n00blet @mjenckel, can you offer something?

  • even better binarization

  • semi-interactively improving GT segmentation (aided by our existing clipping/resegmentation/repair tools)

@bertsky
Copy link
Collaborator Author

bertsky commented Sep 24, 2019

Example from our preprocessing/resegmentation pipeline: cleanly separated from neighbours, but still quite noisy

OCR-D-IMG-DEWARP_0001_r_3_1_tl_10

Example from GT4HistOCR: very clean, but skewed/warped

00008 nrm

@wrznr
Copy link
Contributor

wrznr commented Sep 24, 2019

@bertsky @stweil So our impression

And as already argued, this might be simply the result of filtering out the lesser-quality lines.

that GT4HistOCR is suboptimal for OCR training is real.

And I don't believe you would get much better results if you trained/finetuned a dedicated OCR model on our GT.

I am with you. But we could try to noise GT4HistOCR. Where could we further discuss this issue? (This PR is clearly not the right place.) A dedicated issue in tesstrain?

@bertsky
Copy link
Collaborator Author

bertsky commented Sep 24, 2019

@stweil The above link should be workable on the current versions. But it's probably best to use the PRs I have used: OCR-D/core#311 and #75 and cisocrgroup/ocrd_cis#16 and master cor-asv-ann (for evaluation).

CER measurement in cor-asv-ann-evaluate works as documented: vanilla Levenshtein, no normalization. This is best for comparability, results might look better (and be more fair) with different metrics and normalizations. The package offers some, but results look quite similar with other settings, even NFC.

@bertsky
Copy link
Collaborator Author

bertsky commented Sep 24, 2019

But we could try to noise GT4HistOCR. Where could we further discuss this issue? (This PR is clearly not the right place.) A dedicated issue in tesstrain?

I agree, both degradation and binarization should be employed to make GT4HistOCR models robust.

@wrznr
Copy link
Contributor

wrznr commented Sep 26, 2019

@bertsky Is there PR in core which blocks merging here?

@bertsky
Copy link
Collaborator Author

bertsky commented Sep 26, 2019

Where could we further discuss this issue? (This PR is clearly not the right place.) A dedicated issue in tesstrain?

@stweil Is tesseract-ocr/tesstrain#73 the right place for this? Or better open a new issue strictly about binarization/augmentation (not specific to GT4HistOCR)?

BTW, according to my measurements, Fraktur performs similar on GT4HistOCR data, although not quite as bad as our GT. But of course, it really depends on the subcorpus/century:

  • dta19 and RefCorpus-ENHG-Incunabula and Kallimachos: 7.0% CER
  • dta19 only: 4.9% CER
    Since our GT is more diverse / has less 19th century share, worse results can be expected (independent of bad segmentation / preprocessing).

@bertsky
Copy link
Collaborator Author

bertsky commented Sep 26, 2019

Is there PR in core which blocks merging here?

No, not really. OCR-D/core#311 is related, but not a dependency. Thanks, I will merge this for now.

@bertsky bertsky merged commit 02d5c2b into OCR-D:master Sep 26, 2019
@bertsky bertsky deleted the use-core-alternativeimage-comments-selectors branch October 2, 2020 19:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants