Duplicate hashing #9

bitsondatadev · 2020-11-28T01:20:34Z

I would like to suggest using a hash to determine duplicates of a file rather than the filesize as this can have false positives, especially dealing with the number of photos many people typically store in these services.

md5 is faster than sha in most cases so I would recommend we use this.

>>> image_file = open('2017-11-11/20171111_170331.jpg').read()
>>> image_file2 = open('Saturday in Rockford/20171111_170331.jpg').read()
>>> hashlib.md5(image_file).hexdigest() == hashlib.md5(image_file2).hexdigest()
True

Thoughts?

The text was updated successfully, but these errors were encountered:

bitsondatadev · 2020-11-28T02:34:17Z

Added a simple change. Haven't tested it on a large amount of images yet so may need to consider performance. Especially around larger files like videos.

bitsondatadev · 2020-11-28T02:47:09Z

Added a simple change. Haven't tested it on a large amount of images yet so may need to consider performance. Especially around larger files like videos.

One alternative could be to set a strict limit of any file >10 MB or something, we can use the original size hash and otherwise read the file and run md5. I think there will be lower chances we'll run into false positives as there will be much fewer files in this range.

TheLastGimbus · 2020-11-28T07:28:12Z

Generally, I used size comparison, because I think the probability to have two different photos identical in size 1:1, all single bytes, is near 0. Like, if you just change one value in exif, let's say lat: 49.399999, lng: 22.133333 to lat: 49.4, lng: 22.12, it would not match. If you find me real life example, I would be surprised and instantly add hashes

Only case I imagine is where someone drew two pictures of identical lines in paint, but on different sides, and it happened that they have same size :P

set a strict limit of any file >10 MB

This seems interesting... but I would say 5 or 2.5mb

bitsondatadev · 2020-11-28T09:10:23Z

I've never seen this with images per se, but i've seen it with files in general. It's very unlikely overall until you start getting more and more data. I have 100GB of data from my takeout I just performed. Let me run a quick proof of concept there.

TheLastGimbus · 2020-11-28T09:25:22Z

If I remember well, de-duplicating is done inside "date folders" in takeout - that's why I don't worry about it as much

bitsondatadev · 2020-11-28T14:03:45Z

Yeah, for this case it is probably okay to leave it as size, but I'd like to generalize this duplicate detection for the album use case as well which will introduce a lot more usage. But there are definitely performance implications to be considered. If this isn't something you want to add to your version that is fine too. I can keep it in my fork. I'm already happy you solved the exif and general takeout problem to begin with 😁😁.

TheLastGimbus · 2020-11-28T15:10:05Z

By the way, this is Stackoverflow answear I based on (I should probably include this in comments):

https://stackoverflow.com/questions/748675/finding-duplicate-files-and-removing-them

The guy here first checks the size, then first 1kb, and only then the whole hash 👍

If this isn't something you want to add

Oh I totally do - I'm just afraid if it won't be too slow (this script already kinda is) - but I'm sure the solution above will do

bitsondatadev · 2020-11-28T15:28:34Z

Oh that's brilliant! I really like that solution. We did something similar to this for our checksums when moving files into Glacier from HDFS.

Yeah I think for this being a free option it's okay that it's a little slow initially as long as it works even on a greater amount of data. We can take a second look on where the performance can be addressed. I'm working first on a script to quickly merge my takeout files, then i'll run a POC to determine if I run into any instance of size duplicates that are different photos.

bitsondatadev · 2020-11-28T16:30:01Z

Also there's this little gem.

You increase the risk of false positives because dictionaries hashes_on_1k and hashes_full compare hashes for files which may have different sizes. You should use tuples (size, hash) as keys, not the hash alone. If there are millions of files, the risk is not negligible. – Futal Mar 27 at 10:28

And then someone actually posted the python3 version with this optimization.
https://gist.github.com/tfeldmann/fc875e6630d11f2256e746f67a09c1ae

Why not both?

TheLastGimbus · 2020-11-28T18:52:13Z

If you will implement all of this nicely in #11, I will be very happy to merge it ^_^

bitsondatadev · 2020-11-29T15:24:51Z

Okay! Proof of concept is completed well enough for my standards to warrant this hashing function.

Method

I ran a check, that did the size duplicate check, and performed a hash check on size dups to see if any photos that were false positive duplicates. I then inserted these entries into a dictionary using the hash as the key (to remove real duplicates) and filepath as the value. Then if there is still more than one record in the dictionary that means the photos are not duplicates.

    def find_duplicates(path, filter_fun=lambda file: True):
        hashes_by_size = {}
        # Excluding original files (or first file if original not found)
        duplicates = []
        totalFiles = 0

        for dirpath, dirnames, filenames in _os.walk(path):
            for filename in filenames:
                totalFiles += 1
                if not filter_fun(filename):
                    continue
                full_path = _os.path.join(dirpath, filename)
                try:
                    # if the target is a symlink (soft one), this will
                    # dereference it - change the value to the actual target file
                    full_path = _os.path.realpath(full_path)
                    file_size = _os.path.getsize(full_path)
                except (OSError,):
                    # not accessible (permissions, etc) - pass on
                    continue

                duplicate = hashes_by_size.get(file_size)

                if duplicate:
                    hashes_by_size[file_size].append(full_path)
                else:
                    hashes_by_size[file_size] = []  # create the list for this file size
                    hashes_by_size[file_size].append(full_path)

        for size in hashes_by_size.keys():
            if len(hashes_by_size[size]) > 1:
                dupDict = {get_hash(dir, True):dir for dir in hashes_by_size[size]}

                if len(dupDict) > 1:
                  duplicates.append(" == ".join(dupDict.values()))

        print("Total photo/video files: ", totalFiles)
        print("Number of false positive dups: ", len(duplicates))

        return duplicates

Here is the get_hash function from the SO post

    def get_hash(filename, first_chunk_only=False, hash_algo=_hashlib.sha1):
        hashobj = hash_algo()
        with open(filename, "rb") as f:
            if first_chunk_only:
                hashobj.update(f.read(1024))
            else:
                for chunk in chunk_reader(f):
                    hashobj.update(chunk)
        return hashobj.digest()

Results

My Google takeout folder once combined is 99.16 GB for all of my photos.

Total photo/video files: 62262
Number of false positive dups: 107

Which is .17% of files have a false positive duplicate. NOTE: This is not double counting the false positives against the actual duplicates so this is the exact amount of photos that are different with equivalent sizes!

Below is an example of two photos that are both the size of 675,952 bytes but are different pictures.

bitsondatadev · 2020-11-29T17:11:05Z

Okay, I implemented the multi-stage hashing. Will leave some comments on PR #11.

TheLastGimbus · 2020-11-29T19:51:43Z

OoOokayy - I'm convinced now 😳 - I'll take a look at PR soon 👍

This was referenced Nov 28, 2020

Use hashing to determine albums. #10

Closed

Replacing size hash with multi stage hash dup check #11

Merged

TheLastGimbus closed this as completed in #11 Dec 3, 2020

TheLastGimbus mentioned this issue Feb 21, 2021

Unit tests #68

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicate hashing #9

Duplicate hashing #9

bitsondatadev commented Nov 28, 2020 •

edited

bitsondatadev commented Nov 28, 2020

bitsondatadev commented Nov 28, 2020

TheLastGimbus commented Nov 28, 2020

bitsondatadev commented Nov 28, 2020

TheLastGimbus commented Nov 28, 2020

bitsondatadev commented Nov 28, 2020 •

edited

TheLastGimbus commented Nov 28, 2020

bitsondatadev commented Nov 28, 2020

bitsondatadev commented Nov 28, 2020

TheLastGimbus commented Nov 28, 2020

bitsondatadev commented Nov 29, 2020 •

edited

bitsondatadev commented Nov 29, 2020

TheLastGimbus commented Nov 29, 2020

Duplicate hashing #9

Duplicate hashing #9

Comments

bitsondatadev commented Nov 28, 2020 • edited

bitsondatadev commented Nov 28, 2020

bitsondatadev commented Nov 28, 2020

TheLastGimbus commented Nov 28, 2020

bitsondatadev commented Nov 28, 2020

TheLastGimbus commented Nov 28, 2020

bitsondatadev commented Nov 28, 2020 • edited

TheLastGimbus commented Nov 28, 2020

bitsondatadev commented Nov 28, 2020

bitsondatadev commented Nov 28, 2020

TheLastGimbus commented Nov 28, 2020

bitsondatadev commented Nov 29, 2020 • edited

Method

Results

bitsondatadev commented Nov 29, 2020

TheLastGimbus commented Nov 29, 2020

bitsondatadev commented Nov 28, 2020 •

edited

bitsondatadev commented Nov 28, 2020 •

edited

bitsondatadev commented Nov 29, 2020 •

edited