Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate hashing #9

Closed
bitsondatadev opened this issue Nov 28, 2020 · 13 comments · Fixed by #11
Closed

Duplicate hashing #9

bitsondatadev opened this issue Nov 28, 2020 · 13 comments · Fixed by #11

Comments

@bitsondatadev
Copy link
Contributor

bitsondatadev commented Nov 28, 2020

I would like to suggest using a hash to determine duplicates of a file rather than the filesize as this can have false positives, especially dealing with the number of photos many people typically store in these services.

md5 is faster than sha in most cases so I would recommend we use this.

>>> image_file = open('2017-11-11/20171111_170331.jpg').read()
>>> image_file2 = open('Saturday in Rockford/20171111_170331.jpg').read()
>>> hashlib.md5(image_file).hexdigest() == hashlib.md5(image_file2).hexdigest()
True

Thoughts?

@bitsondatadev
Copy link
Contributor Author

Added a simple change. Haven't tested it on a large amount of images yet so may need to consider performance. Especially around larger files like videos.

@bitsondatadev
Copy link
Contributor Author

Added a simple change. Haven't tested it on a large amount of images yet so may need to consider performance. Especially around larger files like videos.

One alternative could be to set a strict limit of any file >10 MB or something, we can use the original size hash and otherwise read the file and run md5. I think there will be lower chances we'll run into false positives as there will be much fewer files in this range.

@TheLastGimbus
Copy link
Owner

Generally, I used size comparison, because I think the probability to have two different photos identical in size 1:1, all single bytes, is near 0. Like, if you just change one value in exif, let's say lat: 49.399999, lng: 22.133333 to lat: 49.4, lng: 22.12, it would not match. If you find me real life example, I would be surprised and instantly add hashes

Only case I imagine is where someone drew two pictures of identical lines in paint, but on different sides, and it happened that they have same size :P

set a strict limit of any file >10 MB

This seems interesting... but I would say 5 or 2.5mb

@bitsondatadev
Copy link
Contributor Author

I've never seen this with images per se, but i've seen it with files in general. It's very unlikely overall until you start getting more and more data. I have 100GB of data from my takeout I just performed. Let me run a quick proof of concept there.

@TheLastGimbus
Copy link
Owner

If I remember well, de-duplicating is done inside "date folders" in takeout - that's why I don't worry about it as much

@bitsondatadev
Copy link
Contributor Author

bitsondatadev commented Nov 28, 2020

Yeah, for this case it is probably okay to leave it as size, but I'd like to generalize this duplicate detection for the album use case as well which will introduce a lot more usage. But there are definitely performance implications to be considered. If this isn't something you want to add to your version that is fine too. I can keep it in my fork. I'm already happy you solved the exif and general takeout problem to begin with 😁😁.

@TheLastGimbus
Copy link
Owner

By the way, this is Stackoverflow answear I based on (I should probably include this in comments):

https://stackoverflow.com/questions/748675/finding-duplicate-files-and-removing-them

The guy here first checks the size, then first 1kb, and only then the whole hash 👍

If this isn't something you want to add

Oh I totally do - I'm just afraid if it won't be too slow (this script already kinda is) - but I'm sure the solution above will do

@bitsondatadev
Copy link
Contributor Author

Oh that's brilliant! I really like that solution. We did something similar to this for our checksums when moving files into Glacier from HDFS.

Yeah I think for this being a free option it's okay that it's a little slow initially as long as it works even on a greater amount of data. We can take a second look on where the performance can be addressed. I'm working first on a script to quickly merge my takeout files, then i'll run a POC to determine if I run into any instance of size duplicates that are different photos.

@bitsondatadev
Copy link
Contributor Author

Also there's this little gem.

You increase the risk of false positives because dictionaries hashes_on_1k and hashes_full compare hashes for files which may have different sizes. You should use tuples (size, hash) as keys, not the hash alone. If there are millions of files, the risk is not negligible. – Futal Mar 27 at 10:28

And then someone actually posted the python3 version with this optimization.
https://gist.github.com/tfeldmann/fc875e6630d11f2256e746f67a09c1ae

Why not both?

@TheLastGimbus
Copy link
Owner

If you will implement all of this nicely in #11, I will be very happy to merge it ^_^

@bitsondatadev
Copy link
Contributor Author

bitsondatadev commented Nov 29, 2020

Okay! Proof of concept is completed well enough for my standards to warrant this hashing function.

Method

I ran a check, that did the size duplicate check, and performed a hash check on size dups to see if any photos that were false positive duplicates. I then inserted these entries into a dictionary using the hash as the key (to remove real duplicates) and filepath as the value. Then if there is still more than one record in the dictionary that means the photos are not duplicates.

    def find_duplicates(path, filter_fun=lambda file: True):
        hashes_by_size = {}
        # Excluding original files (or first file if original not found)
        duplicates = []
        totalFiles = 0

        for dirpath, dirnames, filenames in _os.walk(path):
            for filename in filenames:
                totalFiles += 1
                if not filter_fun(filename):
                    continue
                full_path = _os.path.join(dirpath, filename)
                try:
                    # if the target is a symlink (soft one), this will
                    # dereference it - change the value to the actual target file
                    full_path = _os.path.realpath(full_path)
                    file_size = _os.path.getsize(full_path)
                except (OSError,):
                    # not accessible (permissions, etc) - pass on
                    continue

                duplicate = hashes_by_size.get(file_size)

                if duplicate:
                    hashes_by_size[file_size].append(full_path)
                else:
                    hashes_by_size[file_size] = []  # create the list for this file size
                    hashes_by_size[file_size].append(full_path)

        for size in hashes_by_size.keys():
            if len(hashes_by_size[size]) > 1:
                dupDict = {get_hash(dir, True):dir for dir in hashes_by_size[size]}

                if len(dupDict) > 1:
                  duplicates.append(" == ".join(dupDict.values()))

        print("Total photo/video files: ", totalFiles)
        print("Number of false positive dups: ", len(duplicates))

        return duplicates

Here is the get_hash function from the SO post

    def get_hash(filename, first_chunk_only=False, hash_algo=_hashlib.sha1):
        hashobj = hash_algo()
        with open(filename, "rb") as f:
            if first_chunk_only:
                hashobj.update(f.read(1024))
            else:
                for chunk in chunk_reader(f):
                    hashobj.update(chunk)
        return hashobj.digest()

Results

My Google takeout folder once combined is 99.16 GB for all of my photos.

Total photo/video files: 62262
Number of false positive dups: 107

Which is .17% of files have a false positive duplicate. NOTE: This is not double counting the false positives against the actual duplicates so this is the exact amount of photos that are different with equivalent sizes!

Below is an example of two photos that are both the size of 675,952 bytes but are different pictures.

IMG_0504
IMG_0214

@bitsondatadev
Copy link
Contributor Author

Okay, I implemented the multi-stage hashing. Will leave some comments on PR #11.

@TheLastGimbus
Copy link
Owner

OoOokayy - I'm convinced now 😳 - I'll take a look at PR soon 👍

@TheLastGimbus TheLastGimbus mentioned this issue Feb 21, 2021
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants