Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase maximum EOF offset for all JPG signatures #53

Open
sbshep opened this issue Apr 24, 2024 · 5 comments
Open

Increase maximum EOF offset for all JPG signatures #53

sbshep opened this issue Apr 24, 2024 · 5 comments

Comments

@sbshep
Copy link

sbshep commented Apr 24, 2024

We are getting a lot of .JPG files from modern camera phones that add a lot of zeroes after the final FFD9, so many as to exceed the maximum EOF offset. The result is that DROID doesn't identify those files.

The existing JPG signature files have a maximum EOF offset of 16000 or 65536 or 131072. Were those offsets chosen for a specific reason? Is there any reason the offset couldn't be extended much higher to account for the extra padding in these modern files?

As a test, I created a signature that matches fmt/645 but I increased the maximum EOF offset to 999999999 (note: I found that going higher by adding even one more 9 resulted in DROID failing to load the profile). I ran a sample file through and it identified it correctly. There is apparently a ceiling past which the profile won't load, but even 999999999 should be sufficient, I think.

Any thoughts or experience with increasing the maximum EOF offset? I've attached a sample .jpg that identifies as fmt/645 if you either remove the padding or increase the maximum EOF offset.

fmt645 if you remove padding.JPG.zip

@thorsted
Copy link
Contributor

Hi Scott, this came up a few times in the past. Adding more range to the EOF offset is an option, but many folks keep the default 65536 maximum byte scan setting in DROID. This definitely needs to be addressed as we see this more and more. This change would affect multiple PUID's, some of them might never have this padding.

The EXIF specs state that any JPG render should ignore anything after the last FFD9 marker. So we should find something similar for PRONOM.

I discussed a bit and linked to samples here: https://preservation.tylerthorsted.com/2023/06/23/jpg-structure/

@richardlehane
Copy link

richardlehane commented Apr 25, 2024

If you aggregate all the PRONOM signatures, the ceiling for EOF scanning is currently 131084:
image

Making this offset larger than that would raise that ceiling and mean doing much larger end-of-file reads for all file types.

You've got another option which is to define additional signatures where the FFD9 marker is anchored to the beginning of the file but at a wild card offset. e.g. FFD8FFE1{2}4578696600004D4D002A*900000070000000430323231*FFD9

In terms of changing the default setting in DROID, I definitely agree. The benchmarks I run show that the impact should be tolerable (DROID still works really well with a -1 setting): https://www.itforarchivists.com/siegfried/benchmarks

@thorsted
Copy link
Contributor

Richard's suggesting of anchoring FFD9 at the end of the BOF signature might work best. Two concerns:

  1. Which PUID's related to JPG does this update need to be applied? All of them?
  2. If someone doesn't have the Max Byte Scan size set to -1, will this signature change the outcome for some?

@sbshep
Copy link
Author

sbshep commented Apr 26, 2024

I echo Tyler's questions. I like the idea of anchoring FFD9 from the BOF, but what happens if there is more than one FFD9 in a file? Does that pose a problem?

@sbshep
Copy link
Author

sbshep commented Jun 17, 2024

Any further thoughts on how to handle these JPG files with extra data after the final FFD9? As I've experimented with anchoring to the BOF, I find that it may identify files that wouldn't identify normally, but since we're using JHOVE as a validation of these files afterward, I'm not so worried about missing a file that has a problem. Is anchoring to the BOF a viable option for everyone?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants