New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Duplicate hashing #9
Comments
Added a simple change. Haven't tested it on a large amount of images yet so may need to consider performance. Especially around larger files like videos. |
One alternative could be to set a strict limit of any file >10 MB or something, we can use the original size hash and otherwise read the file and run md5. I think there will be lower chances we'll run into false positives as there will be much fewer files in this range. |
Generally, I used size comparison, because I think the probability to have two different photos identical in size 1:1, all single bytes, is near 0. Like, if you just change one value in exif, let's say Only case I imagine is where someone drew two pictures of identical lines in paint, but on different sides, and it happened that they have same size :P
This seems interesting... but I would say 5 or 2.5mb |
I've never seen this with images per se, but i've seen it with files in general. It's very unlikely overall until you start getting more and more data. I have 100GB of data from my takeout I just performed. Let me run a quick proof of concept there. |
If I remember well, de-duplicating is done inside "date folders" in takeout - that's why I don't worry about it as much |
Yeah, for this case it is probably okay to leave it as size, but I'd like to generalize this duplicate detection for the album use case as well which will introduce a lot more usage. But there are definitely performance implications to be considered. If this isn't something you want to add to your version that is fine too. I can keep it in my fork. I'm already happy you solved the exif and general takeout problem to begin with 😁😁. |
By the way, this is Stackoverflow answear I based on (I should probably include this in comments): https://stackoverflow.com/questions/748675/finding-duplicate-files-and-removing-them The guy here first checks the size, then first 1kb, and only then the whole hash 👍
Oh I totally do - I'm just afraid if it won't be too slow (this script already kinda is) - but I'm sure the solution above will do |
Oh that's brilliant! I really like that solution. We did something similar to this for our checksums when moving files into Glacier from HDFS. Yeah I think for this being a free option it's okay that it's a little slow initially as long as it works even on a greater amount of data. We can take a second look on where the performance can be addressed. I'm working first on a script to quickly merge my takeout files, then i'll run a POC to determine if I run into any instance of size duplicates that are different photos. |
Also there's this little gem.
And then someone actually posted the python3 version with this optimization. Why not both? |
If you will implement all of this nicely in #11, I will be very happy to merge it ^_^ |
Okay! Proof of concept is completed well enough for my standards to warrant this hashing function. MethodI ran a check, that did the size duplicate check, and performed a hash check on size dups to see if any photos that were false positive duplicates. I then inserted these entries into a dictionary using the hash as the key (to remove real duplicates) and filepath as the value. Then if there is still more than one record in the dictionary that means the photos are not duplicates.
Here is the get_hash function from the SO post
ResultsMy Google takeout folder once combined is 99.16 GB for all of my photos.
Which is .17% of files have a false positive duplicate. NOTE: This is not double counting the false positives against the actual duplicates so this is the exact amount of photos that are different with equivalent sizes! Below is an example of two photos that are both the size of 675,952 bytes but are different pictures. |
Okay, I implemented the multi-stage hashing. Will leave some comments on PR #11. |
OoOokayy - I'm convinced now 😳 - I'll take a look at PR soon 👍 |
I would like to suggest using a hash to determine duplicates of a file rather than the filesize as this can have false positives, especially dealing with the number of photos many people typically store in these services.
md5 is faster than sha in most cases so I would recommend we use this.
Thoughts?
The text was updated successfully, but these errors were encountered: