Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-work the ImageMagick PDF thumbnail filter #8849

Open
alanorth opened this issue May 16, 2023 · 7 comments
Open

Re-work the ImageMagick PDF thumbnail filter #8849

alanorth opened this issue May 16, 2023 · 7 comments
Labels
help wanted Needs a volunteer to claim to move forward new feature tools: media-filters Related to filter-media, full text extraction or thumbnail creation

Comments

@alanorth
Copy link
Contributor

alanorth commented May 16, 2023

Is your feature request related to a problem? Please describe.
The ImageMagick PDF thumbnail filter in DSpace versions up to 7.6 has several problems:

  1. Generation loss due to converting PDF → lossy JPEG → lossy JPEG. See ImageMagickPdfThumbnailFilter.java
  2. Uses outdated JPEG format.

Describe the solution you'd like

First, doing a double lossy conversion is a waste of resources and an obvious bad practice when working with lossy codecs. I recently estimated an average drop of 1.2 points in the ssimulacra2 score due to generation loss. This is like making a photocopy of a photocopy.

Second, we should be using a more modern image format that allows similar visual quality with drastically reduced file sizes. I propose WebP, which requires an average of 33% fewer bits than JPEG to achieve the same visual quality and has broad support in web browsers and beyond. (Yes I know that WebP is not perfect and is already over ten years old, but the results speak for themselves).

On this second point, I have done an extensive evaluation of JPEG versus WebP and AVIF using a large sample of PDFs. The results can be summarized in this one plot of perceptual quality versus bits per pixel (BPP):

AVIF and WebP need less bits than JPEG to achieve the same visual quality

My full comparison, with methodology and source code, is here: Evaluating JPEG, WebP, and AVIF

Additional context
There has been some past discussion and work on the ImageMagick PDF thumbnail filter:

@alanorth alanorth added new feature tools: media-filters Related to filter-media, full text extraction or thumbnail creation labels May 16, 2023
@tdonohue tdonohue added the help wanted Needs a volunteer to claim to move forward label May 16, 2023
@github-project-automation github-project-automation bot moved this to 🆕 Triage in DSpace Backlog May 16, 2023
@tdonohue tdonohue moved this from 🆕 Triage to 🙋 Needs Help / Unscheduled in DSpace Backlog May 16, 2023
@alanorth
Copy link
Contributor Author

alanorth commented May 17, 2023

The various ImageMagick thumbnail filter classes are tightly intertwined, which makes it hard to change the image format for PDFs alone. I suppose we could adopt the WebP format for image thumbnails too—after all, in the case that an ORIGINAL bundle has a JPEG, we already create a lossy JPEG from that to put in the THUMBNAIL bundle.

It could maybe even be configurable. Perhaps @terrywbrady or @tdonohue have some comments or suggestions.

Note: I might even suggest starting on commenting the existing code because it's not well documented.

@tdonohue
Copy link
Member

tdonohue commented May 17, 2023

I'm not against switching to WebP for all thumbnails. It sounds like others (YouTube and Facebook) have made the switch and found better performance: https://web.dev/serve-images-webp/

That said, I think this sort of change would obviously need to be made in a major release (8.0 at the earliest), as it'd require sites to recreate all their thumbnails (luckily we already have a script for that).

@alanorth
Copy link
Contributor Author

Thanks @tdonohue. There is no need for sites to regenerate their thumbnails. Old JPEG thumbnails generated by ImageMagick will still work.

Also, I appreciate the thinking behind the ImageMagick filter setting the bitstream description to IM Thumbnail more than ever. It means that these thumbnails were derived from some original source and can be re-generated if need be. I've regenerated tens of thousands of thumbnails several of times in the past decade (to increase resolution, to take advantage of improvements in ImageMagick/Ghostscript themselves, as well as our own improved handling of various PDF features).

I don't want to bite off more than I can chew, but it might be good to re-think this so that the format is configurable. Maybe some site wants to stick with JPEG. Maybe AVIF or JPEG-XL become viable. Etc...

@alanorth
Copy link
Contributor Author

alanorth commented May 17, 2023

Also, I appreciate the thinking behind the ImageMagick filter setting the bitstream description to IM Thumbnail more than ever.

Ah, I just tested and it seems I misunderstood. The IM Thumbnail description does not behave like the Generated Thumbnail description—in the default configuration it is not considered at all when re-generating thumbnails. The filter-media script looks for bitstreams in the ORIGINAL bundle, for example x.pdf, which do not have a corresponding x.pdf.jpg in the THUMBNAIL bundle and filters them.

If we change the default to WebP then sites will automatically be missing the appropriate x.pdf.webp and there will be a filter storm automatically creating hundreds, thousands, or tens of thousands of WebP files. To make things worse there would be all the old x.pdf.jpg files left over.

Even if we add IM Thumbnail to the default replacement pattern, for example ^(Generated|IM) Thumbnail$, the process won't replace them unless it's for the same source bitstream, ie x.pdf.

So yes this needs more thought. You're right! I will experiment on our own site for implementation ideas and more corner cases.

@mwoodiupui
Copy link
Member

Rather than depending on magical description strings, shouldn't we key off of the MIME type? If there's a Bitstream in a THUMBNAIL bundle where the ORIGINAL bundle's file name is a prefix of its name, and its dc.type matches image/*, then don't derive a new thumbnail. We are setting the type of thumbnail image Bitstreams, yes?

Or, if we want to support "manually" deposited thumbnails, then perhaps DSpace-derived Bitstreams need a "derived from" metadata field. There's probably a standardized name for this relationship in some well-known namespace. Then use this field as the "replace it" criterion. Use the above scheme in the case of missing metadata.

Perhaps there should be an option to bypass the check, obliterate all matching Bitstreams, and derive a new one? (Or do we already have that?)

@mwoodiupui
Copy link
Member

Ugh, this is complex. For full generality, we need the "derived" relationship to be "derived(from, by)" so that "manual" deposits can be marked "derived(from, 'depositor')" and DSpace-derived bitstreams "derived(from, 'repository')".

@alanorth
Copy link
Contributor Author

alanorth commented May 19, 2023

Yeah it's complex. I think the vast majority of sites don't customize their thumbnail setup at all, so we need to try to do the most sane thing by default and leave crazy stuff to over-involved admins like us 😝. The current system might be the sanest in that regard, but the comments in dspace.cfg could be better.

I think it is reasonable that a bitstream x.pdf in the "ORIGINAL" bundle automatically yields a bitstream x.pdf.jpg in the "THUMBNAIL" bundle. And yes, when the filters create the thumbnail they add a bitstream description and a format.

There's an "obliterate all matching Bitstreams" option (aka force), but the keyword there is "matching". In the case of the ImageMagick Thumbnail Filter, a thumbnail will only be replaced in "force" mode if:

  • The thumbnail name is: x.pdf.jpg
  • The thumbnail description is: IM Thumbnail

Otherwise, the filter-media script assumes this is a manually uploaded thumbnail. And this does nothing for the case where we want to switch the default thumbnail format to WebP, because we blindly create x.pdf.webp and leave behind the x.pdf.jpg. In our repository at least this would be close to 50,000 old JPEGs!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Needs a volunteer to claim to move forward new feature tools: media-filters Related to filter-media, full text extraction or thumbnail creation
Projects
Status: 🙋 Needs Help / Unscheduled
Development

No branches or pull requests

3 participants