Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix key_from_photo in sync, move fingerprinting code to a common function #1389

Closed
RhetTbull opened this issue Jan 15, 2024 · 9 comments
Closed
Labels
refactoring Code that should be refactored

Comments

@RhetTbull
Copy link
Owner

See #1386

Several osxphotos commands need to match photos between libraries. For example:

osxphotos sync
osxphotos import to find duplicates
osxphotos exportdb --migrate-photos-library

The code to do so is slightly different in each case. Create a common fingerprinting function that given a PhotoInfo object or a JSON dump of one, returns the right fingerprint. (Need a better name than fingerprint as this specifically means the fingerprint stored in the database.

@RhetTbull RhetTbull added the refactoring Code that should be refactored label Jan 15, 2024
@msg43
Copy link

msg43 commented Feb 24, 2024

Oh wow, this would be so cool! I don't know much about Github and how this stuff works. Is this the kind of thing that at the bottom of a long wishlist or a the top of a short honeydo list? :)

@msg43
Copy link

msg43 commented Feb 24, 2024

Obviously not meaning to try to push you or anything, I just genuinely have no idea how this works. If it were something you were planning to tackle I would certainly wait for it. Again happy to offer a token of appreciation in whatever way I can. I wish I could help code it....

@RhetTbull
Copy link
Owner Author

For this particular one, it's probably top of the list. No promises on timeframe but maybe a week or two.

@msg43
Copy link

msg43 commented Feb 24, 2024 via email

@RhetTbull
Copy link
Owner Author

See also #939 which would need this

@RhetTbull
Copy link
Owner Author

To implement this, create a new comparison function or class that given a PhotoInfo returns a signature class or tuple. The signature should return (original_filename, fingerprint, cloud GUID, size). Also include a comparison function that compares two PhotoInfo objects using the signature.

  1. If fingerprint matches, the photos are the same
  2. If no fingerprint, check:
  3. filename+cloud GUID then
  4. filename + size

@RhetTbull
Copy link
Owner Author

See also #1374, #1375

@RhetTbull
Copy link
Owner Author

RhetTbull commented Feb 24, 2024

The following osxphotos features need the ability to compare two photos (either an asset in the database and a file on disk or two database assets that might be in two different databases) and determine if they are the same file.

osxphotos sync to match metadata
osxphotos import to find duplicates
osxphotos exportdb --migrate-photos-library
osxphotos compare to implement this feature

The easiest way to do this would be the use the PhotoInfo.fingerprint and fingerprint() to compute the fingerprint of the photo. There are a couple problems with this approach:

  1. Photos computes fingerprints upon import only for Photos and not Videos (edit: this is true on older versions of macOS. On Ventura, it appears videos do contain a fingerprint) My guess is this is to make the UI more responsive on import as the fingerprint (which is actually a hash) would require reading the entire file from disk. However, this only happens once so it would've been nice if Apple implemented this.
  2. I have seen a few instances where a photo asset had a null fingerprint.
  3. Shared photos do not contain a fingerprint.
  4. The fingerprint() function could be used to get the fingerprint for those assets that didn't have one but it requires the original be on the disk which cannot be guaranteed for iCloud libraries that don't use "store originals on this Mac"

Given above, osxphotos needs a unique signature/fingerprint function that can be used across at least the sync, import, exportdb, and compare commands.

The current implementation is:

The exportdb command migration tool does this which is the most robust implementation so far:

    for photo in photosdb.photos():
        photosdb_fingerprint[
            f"{photo.original_filename}:{photo.fingerprint}"
        ] = photo.uuid
        photosdb_cloud_guid[
            f"{photo.original_filename}:{photo.cloud_guid}"
        ] = photo.uuid
        photosdb_name_size[
            f"{photo.original_filename}:{photo.original_filesize}"
        ] = photo.uuid
        if photo.shared:
            photosdb_shared[_shared_photo_key(photo)] = photo.uuid

and the _shared_photo_key() is:

def _shared_photo_key(photo: PhotoInfo | dict[str, Any]) -> str:
    """return a key for matching a shared photo between libraries"""
    photoinfo = photo.asdict() if isinstance(photo, PhotoInfo) else photo
    date = photoinfo.get("date")
    if isinstance(date, datetime.datetime):
        date = date.isoformat()
    return (
        f"{photoinfo.get('cloud_owner_hashed_id')}:"
        f"{photoinfo.get('original_height')}:"
        f"{photoinfo.get('original_width')}:"
        f"{photoinfo.get('isphoto')}:"
        f"{photoinfo.get('ismovie')}:"
        f"{date}"
    )

The sync command does this which is wrong because videos won't have a fingerprint:

def key_from_photo(photo: PhotoInfo) -> str:
    """Return key for photo used to correlate photos between libraries"""
    return f"{photo.fingerprint}:{photo.original_filename}"

The import command uses a FingerprintQuery class that allows lookup in the database by fingerprint (because every photo being imported needs to be checked against every other photo in the library). This uses fingerprint if there is one and if not, defaults to filename:size:

       Note: returns empty list if no possible duplicates found
        """
        # first check by fingerprint
        # Photos stores fingerprint for photos but not videos
        if results := self.photos_by_fingerprint(fingerprint(filepath), in_trash):
            return results

        # if not fingerprint matches, could still be a match based on filename/size
        # this is not 100% perfect but close enough to find likely duplicates
        filename = pathlib.Path(filepath).name
        size = pathlib.Path(filepath).stat().st_size
        if results := self.photos_by_filename_size(filename, size, in_trash):
            return results

        return []

For the import implementation, whatever is chosen for fingerprint needs to be something that can be easily queried against the library because this is done as a just in time SQL read so that the entire library doesn't have to be parsed as in the case of export and other commands. It also needs to be something that can be computed from an asset on disk that's not yet been imported into the library, hence there won't by a cloud GUID, etc. If using "internal" metadata such as width / height, that requires reading the metadata from the file which introduces a dependency on some means to do this (like CGMetaData or exiftool. Only import currently needs this feature and that only works on macOS so CGMetaData might be suitable if support for reading AVAsset metadata were added

Given all the above constraints, the approach used for exportdb might be the simplest:

If photo has fingerprint, use filename:fingerprint
If not, use filename:size
If shared, use shared key.

filename:fingerprint will always work.
filename:size may result in false positives. How likely is this that an asset with

I tested my personal library of 30,000 photos and found 4 cases where filename:size matched but photos had a different fingerprint. All 4 were duplicates that had been shared by others via AirDrop or text message. Not sure why the signatures were different but the images were the same. This is a small data sample but might be "good enough" for this approach. The 4 actually all had fingerprints so these would not have had conflicts with this implementation.

On Ventura it appears that Videos do have a fingerprint so this may be moot. I'll check Sonoma as well.

@RhetTbull
Copy link
Owner Author

It appears Ventura 13.5+ does compute a fingerprint for every asset that's imported to the library. In my library there 28 images without fingerprint and all were syndicated / not saved to the library. This vastly simplifies things for Ventura+ because we can rely on fingerprint alone.

RhetTbull added a commit that referenced this issue Feb 27, 2024
RhetTbull added a commit that referenced this issue Feb 27, 2024
* Implemented signature, #1389 for sync and exportdb

* Fix for linux

* Fix for linux

* Add test for #1389
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
refactoring Code that should be refactored
Projects
None yet
Development

No branches or pull requests

2 participants