-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
compression artifacts in GT #33
Comments
Hi @bertsky, Thank you very much. The listed objects come from different libraries. Even today, no TIFF images can simply be downloaded. TIFF header: Why are there such data in GroundTruth? However, we know that training requires the best data, which should be available in a very large number and variety. We are still trying to increase the number of training data. |
Thanks @tboenig for this thorough investigation and explanation! If those files are there to stay, and for good reasons too, then I recommend at least marking them as degenerate in the GT repos (or even splitting GT into a "good" and a "robust" set). Also, under these circumstances, I think we should give binarization a closer look (effective DPI, artifacts). |
@bertsky |
@tboenig will provide those lists and we will evaluate how to integrate automated checks (image characterization) into workspace validation in core. |
@bertsky was referring to the GT we offer for training not the assets repo itself. |
What's the status of the work on a And related but independent, those datasets which have wrong resolution metadata (e.g. (The images in the 2 mentioned bags also contain a digital footer added to the scan – this is clearly wrong, isn't it?) |
To illustrate, this is what happens during Thus, because
|
@tboenig being the GT guru should answer this. Pragmatically, I would relax the requirements on pixel density since we just cannot rely on image metadata for this. Unfortunately. c.f. OCR-D/spec#129 and OCR-D/core#339 |
Thanks @kba for addressing this quickly. This is a real problem for our workflows – for preprocessing (as can be seen above) just like segmentation and OCR (e.g. Tesseract's DPI variable). I am a bit surprised by your stance, though. When @wrznr and I brought this up on the last developer workshop, we encouraged module projects to make their components DPI-aware/relative. Why was there no objection at the time? However, if you want to do it this way, please do it better. I took the liberty to add reviews on both your spec PR (for a better definition of exceptions) and core PR (for a more manageable reaction). I know it's much more work, but I believe we risk loosing big time in overall achievable quality if we just let this slip through. |
Another report on GT issues (not assets):
In …
…images show clear signs of JPEG compression, with notable artifacts around sharp contrast like graphemes. ImageMagick identifies them as TIFF with 200 PPI (or 72 PPI or no resolution tag at all), without compression, and without any
crs
orexif
tags and with very fewtiff
tags (e.g. nosoftware
orartist
).(In contrast, "good" images in other workspaces are identified as TIFF with 300 PPI without compression with full
aux
,crs
,xmp
,exif
andtiff
tags, which list the camera model, exposure settings, the true date stamp – somewhere in 2011 – and that it was created withAdobe Photoshop Lightroom
. Sometimes, they are also TIFF with 300 PPI without compression without those tags but listingIrfanView
orPROView
orOmniScan
ormultidotscan
as creator software.)I found this because I had trouble binarizing such images: I would always get too many (un)connected components, regardless of threshold settings.
@tboenig I'd say this is the most urgent issue so far.
The text was updated successfully, but these errors were encountered: