Fix Pillow high depth images #627

bertsky · 2020-10-18T23:08:12Z

Digitisation sometimes produces 16-bit or even 32-bit grayscale scans, which Pillow still (as of 8.0) has poor support for. This affects us severely, because we rely on PIL's median, convert and paste to generate page and segment images for all processors. Due to the pertinent bugs (at least python-pillow/Pillow#3159, python-pillow/Pillow#3838, python-pillow/Pillow#3011), we then get fully or partially blacked out crops.

This workaround attempts to convert these images to 8 bit depth before doing anything else (which is hard enough to get right). The idea is that as soon as images enter the API, be it original or derived, we prevent PIL from doing any harm (but sacrifice some precision).

Here is an example image (which being TIFF unfortunately Github won't embed directly):

OCR-D-IMG_APBB_Mitteilungen_62.0002.zip

I'm sure you'll (rightly) ask for more "correct" (or at least well-defined) test images, but this may take us some time to curate. I can already point you to Pillow's own testset, but have not properly screened it for good case studies yet.

As an additional difficulty, many image viewer tools don't handle these formats correctly either. IM's display does, but its identify is not easy to interprete here (but does still help). It seems to discern between actual depth (as in pixel statistics?) vs formal depth (as in metadata). You'll get

actual depth via -format "%[bit-depth]" or the part after the slash in -verbose's Depth: description (if any)
formal depth via -format "%z" or the part before the slash in -verbose's Depth: description (always)

For example, I've seen 16-bit, 16/8-bit, 16/15-bit, 32/16-bit images.

Anyway, I suggest separating the full regression test control from the actual workaround for now and fast-track this. I'll re-run this on a couple of other workspaces and watch for discrepancies.

codecov-io · 2020-10-18T23:10:47Z

Codecov Report

Merging #627 into master will decrease coverage by 0.39%.
The diff coverage is 22.22%.

@@            Coverage Diff             @@
##           master     #627      +/-   ##
==========================================
- Coverage   84.72%   84.32%   -0.40%     
==========================================
  Files          52       52              
  Lines        2952     2966      +14     
  Branches      575      579       +4     
==========================================
  Hits         2501     2501              
- Misses        336      349      +13     
- Partials      115      116       +1

Impacted Files	Coverage Δ
ocrd/ocrd/workspace.py	`66.06% <22.22%> (-2.93%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 56c4aa6...e202bb8. Read the comment docs.

kba · 2020-10-19T08:50:22Z

OK, so while this incurs a bit of a computing penalty, it will also reduce memory consumption, right?

If it improves handling with pillow: fine with me, though I would really like to see a unit test for this.

kba · 2020-10-19T09:01:27Z

AFAICT, we don't have any 16-or-more-bit images in assets currently:

identify -format "%[filename]: %[bit-depth]\n" **/*tif **/*.png

bertsky · 2020-10-19T09:17:06Z

AFAICT, we don't have any 16-or-more-bit images in assets currently:
identify -format "%[filename]: %[bit-depth]\n" **/*tif **/*.png

Yes, only 8-bit and 1-bit so far.

OK, so while this incurs a bit of a computing penalty, it will also reduce memory consumption, right?

Actually I think it reduces both CPU time and memory, because in average workflows, lots of operations follow up on that first open, which will run faster on smaller depths.

The only true cost is precision here – which I would have preferred to avoid, but Pillow does not allow us. (I thought about using RGB for this, because 24-bit should not incur the large quantization error of single-channel 8-bit. But simply having three identical 8-bit channels would not help at all. We would have to pay attention to color profiles everywhere, esp. linear vs logarithmic, so we can afterwards convert back the luma-equivalent grayscale value.)

bertsky · 2021-11-05T17:47:56Z

Sorry, I just found another case where the current implementation does not work yet: 16 bit grayscale which gets imported by Pillow as I (not I;16) and thus also its Numpy interface yields int32, which causes all of our range to get shifted to 0. (Cannot share the image for privacy, but will dig up an equivalent and then open an issue...)

bertsky added 2 commits October 18, 2020 21:23

ocrd.workspace: delegate to ocrd_modelfactory.exif_from_filename

8accb87

ocrd.workspace: always reduce to 8-bit depth after PIL.Image.open

6d82696

ocrd.workspace: take OcrdExif from unconverted original

e202bb8

kba approved these changes Oct 19, 2020

View reviewed changes

kba added 2 commits October 23, 2020 18:34

test handling of 16-bit TIF

40ea355

Merge branch 'master' into fix-pillow-high-depth-images

83d9273

kba merged commit affe113 into OCR-D:master Oct 23, 2020

bertsky mentioned this pull request Jan 13, 2021

handle CMYK images #656

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Pillow high depth images #627

Fix Pillow high depth images #627

bertsky commented Oct 18, 2020

codecov-io commented Oct 18, 2020 •

edited

kba commented Oct 19, 2020

kba commented Oct 19, 2020

bertsky commented Oct 19, 2020

bertsky commented Nov 5, 2021

Fix Pillow high depth images #627

Fix Pillow high depth images #627

Conversation

bertsky commented Oct 18, 2020

codecov-io commented Oct 18, 2020 • edited

Codecov Report

kba commented Oct 19, 2020

kba commented Oct 19, 2020

bertsky commented Oct 19, 2020

bertsky commented Nov 5, 2021

codecov-io commented Oct 18, 2020 •

edited