compression artifacts in GT #33

bertsky · 2019-06-19T09:05:14Z

Another report on GT issues (not assets):

In …

…images show clear signs of JPEG compression, with notable artifacts around sharp contrast like graphemes. ImageMagick identifies them as TIFF with 200 PPI (or 72 PPI or no resolution tag at all), without compression, and without any crs or exif tags and with very few tiff tags (e.g. no software or artist).

(In contrast, "good" images in other workspaces are identified as TIFF with 300 PPI without compression with full aux, crs, xmp, exif and tiff tags, which list the camera model, exposure settings, the true date stamp – somewhere in 2011 – and that it was created with Adobe Photoshop Lightroom. Sometimes, they are also TIFF with 300 PPI without compression without those tags but listing IrfanView or PROView or OmniScan or multidotscan as creator software.)

I found this because I had trouble binarizing such images: I would always get too many (un)connected components, regardless of threshold settings.

@tboenig I'd say this is the most urgent issue so far.

The text was updated successfully, but these errors were encountered:

tboenig · 2019-06-24T16:59:48Z

Hi @bertsky,

Thank you very much.
In order to understand GroundTruth I have to look at the background of the creation of the data.
The GroundTruth data is based on the German Text Archive. The data was written manually on the basis of very legible and high-resolution images. The quality of the images should offer the transcriber a high magnification, so that he can capture the text 100% in full text.

The listed objects come from different libraries.
Because these libraries did not provide the German Text Archive with TIFF files, the JPG files provided had to be used. Even in the case of queries addressed to some libraries, no TIFF files could be provided for the titles mentioned. The DTA project was also unable to afford the costs for subsequent digitisation. See for example:
https://www.sub.uni-goettingen.de/fileadmin/media/texte/benutzung/Preisliste_Reproductions_20150306.pdf

Even today, no TIFF images can simply be downloaded.

TIFF header:
The files were previously JPG files, so there can be no correct header available, which might correspond to the guidelines: https://www.slub-dresden.de/fileadmin/groups/slubsite/SLUBArchiv/SLUBArchiv_Hanreichung_TIFF_v1.3.pdf.
As far as I know, there is no uniform rule for libraries which header data to use. For this reason, heterogeneity must always be expected.

Why are there such data in GroundTruth?
It is not unrealistic that such data, despite all due care, are stored in the libraries and have to be converted into full text. The goal of OCR-D should be that the programs and algorithms are so stable that they can handle the artifacts easily.

However, we know that training requires the best data, which should be available in a very large number and variety. We are still trying to increase the number of training data.

bertsky · 2019-06-24T17:26:45Z

Thanks @tboenig for this thorough investigation and explanation!

If those files are there to stay, and for good reasons too, then I recommend at least marking them as degenerate in the GT repos (or even splitting GT into a "good" and a "robust" set).

Also, under these circumstances, I think we should give binarization a closer look (effective DPI, artifacts).

tboenig · 2019-06-24T21:25:00Z

@bertsky splitting GT into a "good" and a "robust" set
That's a really good idea. I'll see how I implement it.

kba · 2019-06-25T09:32:37Z

@tboenig will provide those lists and we will evaluate how to integrate automated checks (image characterization) into workspace validation in core.

cneud · 2019-10-17T00:04:40Z

I strongly opt for keeping the above part of assets for testing purposes as this well reflects real-life scenarios for which the OCR-D stack should be made robust (what @tboenig said).

kba · 2019-10-23T13:13:10Z

I strongly opt for keeping the above part of assets for testing purposes as this well reflects real-life scenarios for which the OCR-D stack should be made robust (what @tboenig said).

@bertsky was referring to the GT we offer for training not the assets repo itself.

bertsky · 2019-11-01T10:39:48Z

What's the status of the work on a good vs robust split of GT data?

And related but independent, those datasets which have wrong resolution metadata (e.g. praetorius_syntagma02_1619_teil2 and glauber_opera01_1658 reporting 72 DPI, whereas they are in fact 600 DPI), shouldn't their header information be corrected at least? (Remember, we now rely on pixel density – where annotated – in core and other processors.)

(The images in the 2 mentioned bags also contain a digital footer added to the scan – this is clearly wrong, isn't it?)

bertsky · 2019-11-01T10:59:59Z

(Remember, we now rely on pixel density – where annotated – in core and other processors.)

To illustrate, this is what happens during ocrd-cis-ocropy-dewarp in a sensible preprocessing pipeline:

Thus, because

600 actual DPI got interpreted as 72 reported DPI),
the region was deemed too large for line segmentation in ocrd-cis-ocropy-resegment,
the GT line segmentation (which has large overlaps) was applied unchanged,
intruders from the neighbouring lines interfered with center line estimation,
dewarping actually warps (deteriorates) the line images even more.

OCR-D/spec#129 OCR-D/assets#33

kba · 2019-11-01T16:08:16Z

@tboenig being the GT guru should answer this.

Pragmatically, I would relax the requirements on pixel density since we just cannot rely on image metadata for this. Unfortunately. c.f. OCR-D/spec#129 and OCR-D/core#339

bertsky · 2019-11-02T22:19:43Z

Thanks @kba for addressing this quickly. This is a real problem for our workflows – for preprocessing (as can be seen above) just like segmentation and OCR (e.g. Tesseract's DPI variable).

I am a bit surprised by your stance, though. When @wrznr and I brought this up on the last developer workshop, we encouraged module projects to make their components DPI-aware/relative. Why was there no objection at the time?

However, if you want to do it this way, please do it better. I took the liberty to add reviews on both your spec PR (for a better definition of exceptions) and core PR (for a more manageable reaction). I know it's much more work, but I believe we risk loosing big time in overall achievable quality if we just let this slip through.

kba assigned tboenig Jul 9, 2019

bertsky mentioned this issue Aug 3, 2019

OcrdExif: Pillow's IFDRational is unsupported OCR-D/core#270

Closed

kba added a commit to kba/ocrd-core that referenced this issue Nov 1, 2019

reduce pixel_density problem severity: error -> warning,

17cf70d

OCR-D/spec#129 OCR-D/assets#33

kba mentioned this issue Nov 1, 2019

reduce pixel_density problem severity: error -> warning, OCR-D/core#339

Merged

bertsky mentioned this issue Nov 4, 2019

Relax requirements on pixel density in image metadata OCR-D/spec#129

Merged

cneud added the groundtruth label Nov 5, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

compression artifacts in GT #33

compression artifacts in GT #33

bertsky commented Jun 19, 2019

tboenig commented Jun 24, 2019

bertsky commented Jun 24, 2019

tboenig commented Jun 24, 2019 •

edited

Loading

kba commented Jun 25, 2019

cneud commented Oct 17, 2019

kba commented Oct 23, 2019 •

edited

Loading

bertsky commented Nov 1, 2019

bertsky commented Nov 1, 2019

kba commented Nov 1, 2019

bertsky commented Nov 2, 2019

compression artifacts in GT #33

compression artifacts in GT #33

Comments

bertsky commented Jun 19, 2019

tboenig commented Jun 24, 2019

bertsky commented Jun 24, 2019

tboenig commented Jun 24, 2019 • edited Loading

kba commented Jun 25, 2019

cneud commented Oct 17, 2019

kba commented Oct 23, 2019 • edited Loading

bertsky commented Nov 1, 2019

bertsky commented Nov 1, 2019

kba commented Nov 1, 2019

bertsky commented Nov 2, 2019

tboenig commented Jun 24, 2019 •

edited

Loading

kba commented Oct 23, 2019 •

edited

Loading