Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

exiftool based image exif cleaner #263

Merged
merged 1 commit into from
Nov 26, 2023

Conversation

asdfzdfj
Copy link
Contributor

@asdfzdfj asdfzdfj commented Nov 14, 2023

something I mentioned to put up weeks ago

add an image metadata sanitizer/cleaner based on exiftool, comes with following modes, configurable via env variables:

  • none: disable image metadata cleaning. default for external images.
  • sanitize: strips just GPS data and what might looks like serial numbers. default for uploaded images.
  • scrub: strip most of metadata save for those needed for proper image rendition and some attribution tags.

I tried to rig this up as an optional feature, as in it would only work when a valid exiftool binary is found and is executable, and it's effectively a no op when that's not the case without impeding the entire image processing flow.


concerns about where to plug the processor

now the hard part that I could use some suggestion: right now I plugged the cleaner inside ImageRepository::findOrCreateFromPath, right just before storing the image, and I'm not sure if there's more appropiate place to put this:

  • putting it here means all image introduced would be cleaned, including federated images, which may not be desired for busy instances
  • putting it before findOrCreateFrom{Path,Upload} could results in image lookup by hash being less effective because the input image will have to be cleaned before checking for hash although the same, cleaned image might already exists, which means what's supposedly fast path is now slower and the cleaning job is practically wasted
  • right now, the function now looks like at least 3 functions stacked together and looks like it could use some refactoring, but I'm not sure how to go with it (and for some inexplicable reasons I feel like event dispatcher could be used here)

ended up refactored/rearranged the findOrCreateFrom{Path,Upload} functions, reviewed separately at #287

@nobodyatroot nobodyatroot added enhancement New feature or request backend Backend related issues and pull requests labels Nov 14, 2023
@e-five256
Copy link
Member

e-five256 commented Nov 14, 2023

putting it here means all image introduced would be cleaned, including federated images, which may not be desired for busy instances

I think it might be best to only handle local images if possible and not make it the job of the instance owner to scrub all remote media. I don't think anything legally harmful can come from this metadata, it's more of a personal data / security thing? I would expect most incoming to already be scrubbed, it looks like pict-rs does for lemmy and I think imgur did unless they changed something

@e-five256 e-five256 added the needs feedback Requires a greater consensus to make an informed decision label Nov 14, 2023
@asdfzdfj
Copy link
Contributor Author

putting it here means all image introduced would be cleaned, including federated images, which may not be desired for busy instances

I think it might be best to only handle local images if possible and not make it the job of the instance owner to scrub all remote media. I don't think anything legally harmful can come from this metadata, it's more of a personal data / security thing? I would expect most incoming to already be scrubbed, it looks like pict-rs does for lemmy and I think imgur did unless they changed something

cleaning only local images does make sense, but then I'll either go back into the 2nd point where reusing existing image is less effective, or having to duplicate the findOrCreateFromPath into findOrCreateFromUpload just to add the cleaner command, and I'd like to not duplicate the code

also choosing to clean remote images may not be entirely pointless: it reduces amount of image copies that contains sensitive information -- so ideally there would be only one copy with sensitive information unintentionally added on origin instance/server, instead of who knows how many instances that the same image was federated out and mirrored. if anything maybe make it an opt in options to enable it

.env.example Outdated
@@ -72,6 +72,12 @@ OAUTH_KEYCLOAK_VERSION=
# If true, sign ins and sign ups will only be possible through the OAuth providers configured above
SSO_ONLY_MODE=

# image exif cleaning options
# available value: none, sanitize, scrub
EXIF_CLEAN_MODE=sanitize
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep in mind, to also set some defaults. To avoid regression during updating for server owners who do not yet have this option in the config.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll see what I can do

@asdfzdfj asdfzdfj force-pushed the new/image-exif-scrubber branch 3 times, most recently from 23e65d0 to 7294902 Compare November 21, 2023 12:16
@asdfzdfj
Copy link
Contributor Author

updates:

  • refactored/rearranged the findOrCreateFrom{Path,Upload} functions (reviewed separately at add image post processing event hook #287, get that merged first)
  • added option to specify different cleaning modes for uploaded and external images,
    defaults to sanitize for uploaded images and none (do nothing) for external images
  • set defaults for configuration such that it shouldn't crash and still functions if config env vars are not set

@asdfzdfj asdfzdfj marked this pull request as ready for review November 22, 2023 17:10
@asdfzdfj asdfzdfj force-pushed the new/image-exif-scrubber branch 4 times, most recently from 5baf07b to d795f48 Compare November 25, 2023 04:51
@e-five256 e-five256 removed the needs feedback Requires a greater consensus to make an informed decision label Nov 25, 2023
Copy link
Member

@e-five256 e-five256 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm to me, ran with both sanitize and scrub on local uploads and it correctly removed gps data in the first test and all the shutter/focal data in the second test

one thing to note is that uploading an image that already was uploaded pre-cleaned seemed to just reuse the same image and not clean it, which I think is expected, I imagine it detects it before running the logic which makes sense, just a note that reusing past uploaded images won't change the metadata after this

add an image metadata sanitizer/cleaner based on exiftool, comes with
following modes configurable via env variables:

- none: disable image metadata cleaning. default for external/federated images
- sanitize: strips just GPS data and what might looks like serial numbers.
  default for uploaded image
- scrub: strip most of metadata save for those needed for proper image
  rendition and some attribution tags

tried to set this up as an optional features, as in it will work
when `exiftool` binary exist and usable and not impede existing image
processing/store flow if wasn't present
@asdfzdfj asdfzdfj merged commit b86a848 into MbinOrg:main Nov 26, 2023
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend Backend related issues and pull requests enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants