New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor duplication and Add Albums #36
Refactor duplication and Add Albums #36
Conversation
😍 💖 🥇Let me know if you would need any review/brainstoriming, or had any parts of the code I could help with without interrupting you |
print(file) | ||
#print(file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably neet to wrap the whole function in try
, and print the file only in catch. I will do this when resolving #25
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha yeah I will leave this alone since it isn't related I just need the logs to be a little less spammy 😜.
69cf077
to
8fc32c8
Compare
@TheLastGimbus I'm testing this now on a bigger takeout but you're free to start looking over changes. They are updated for the latest takeout format. |
8fc32c8
to
fbddfba
Compare
Final statistics:
Ran this on my takeout and it took about 30 min over an 85GB though it says files copied my folder is showing 17,461 items of the original 52,579 items. So it seems like some files weren't copied correctly maybe? I'm seeing 40.81GB in the output folder. Seems a bit small..i'll investigate a bit more tomorrow. |
I noticed that "file copied count" was incorrect a bit - but for me, it said 1578 instead of 1576, so I didn't care about it 😛 |
Okay, so there's 3046 pictures/videos that are missing in my output folder that were in the original folder. It seems to be due to a name clash and somehow didn't get resolved by EDIT: BTW I think this is something I've just introduced with my dedup stuff, not something that was preexisting so that nobody else worries :). |
Okay, I figured out the issue. It was related to my suspicions of if watch_for_duplicates:
if new_name.stat().st_size == file.stat().st_size:
return file The above code is now unnecessary with the new duplication checks that we do globally and was actually throwing of the rename. Removed that and now it works as expected. going to run again on my takeout folder and compare again. |
As I understand, I can start reviewing this and it's almost ready to merge? 👀 Ps. Happy new year 🎉 |
Happy New Year!! 🎊 Yes I am running now and will compare the results shortly. It worked on a small dataset when I removed this! Here's hoping for no more artifacts.🤞 |
fbddfba
to
9a7f34f
Compare
Okay this is looking a little better, Final statistics:
All previous files that had been missing before are now there. I don't see any other artifacts. It looks like the amount of photos are a little under half which makes sense due to the copies that were maintained in various albums. I will keep searching through but I think it's safe to say that you can start reviewing @TheLastGimbus. |
03104a3
to
d863bd3
Compare
So we still have that issue to solve? |
#TODO reconsider now that we're searching globally | ||
#check which duplicate has best exif? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Um
for files in files_by_full_hash
If two files have identical full hash, they also have identical exif - do they?
We can select ones that have corresponding json
found, if that's what you meant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's correct, I'll remove this comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to make an edge case where there was none 🤪.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can probably still
select ones that have corresponding json
Just select one with shortest name (so no (1)
stuff)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll just keep it a generic "optimize in some way" lol
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Files where date was set from name of the folder: 315
So we still have that issue to solve?
So that's not a bug. There are a lot of edge cases that can cause this one example is stupid Google json naming:
ls "/Users/brian/Downloads/Google Exports/Photos/Takeout 2/Google Photos/Photos from 2013/EntityRelationshipDiagram"*
/Users/brian/Downloads/Google Exports/Photos/Takeout 2/Google Photos/Photos from 2013/EntityRelationshipDiagram.jpeg.json
/Users/brian/Downloads/Google Exports/Photos/Takeout 2/Google Photos/Photos from 2013/EntityRelationshipDiagram.jpg
Notice for some reason, google didn't use the same title as jpg vs jpeg.
We'll probably want to tackle one-by-one and may be a little out of scope for this PR.
#TODO reconsider now that we're searching globally | ||
#check which duplicate has best exif? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll just keep it a generic "optimize in some way" lol
d863bd3
to
f4d4578
Compare
@TheLastGimbus I made one last fix to how albums.json was getting generated. There was an issue where we would write the name of a duplicate we deleted rather than the file that actually existed in the final "cut". I had to add one more dictionary to hold rename info but it's relatively small footprint. There are still issues as I pointed out above due to Google namings that should get addressed in other PRs. Let me know if you see anything else otherwise I would say it's ready to merge. |
f4d4578
to
c5de860
Compare
Almost forgot to update the README. :) Now it's really ready! |
By the way:
This is outstanding 0_o 🔥 |
643a682
to
63dad25
Compare
if len(full_hash_files) != 1: | ||
print( | ||
"full_hash_files list should only be one after duplication removal, bad state") | ||
exit() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You've inserted exit()
here, but because there is no code number, it doesn't (i think so) it was in a dummy "catch all"
I'm already heavliy editing this part, will want you to review after I'm done
@@ -233,38 +268,30 @@ def find_duplicates(path: Path, filter_fun=lambda file: True): | |||
|
|||
files_by_full_hash[full_hash].append(file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
files_by_full_hash
is appended only in find_dupicates
...
So populate_album_map
can be only run when --keep-duplicates
wasn't set, right?
We can also make some logic to externally run find_duplicates
...
I know, all of this options and flags complicate everything 😕 I wanted to keep them in place in case some functions break, and to be able to just disable them, but now they just break existing options 😆
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm gonna make a suggestion since I don't have much time recently (due to work and my newborn) to look through all the if/else scenarios.
-
we get rid of --keep-duplicates option (always remove duplicates) --dont-copy (always copy) and --dont-fix options (always fix). I kind of made those assumptions when writing this code tbh.
-
I explain the rest of my PR here and hand off the torch to someone else to take over to handle these options.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I:
- added flag to disable albums
- removed unecessary indenations and
try-catch all
s - in:get_date_from_folder_meta
- all of thesecatch all
s make the code unsafe - I prefer people jamming me with printed errors rather than some super-hard-to-find bugs.find_album_meta_json_file
already makes sure that it gives us valid album file 🙆populate_album_map
- again, too broad catches could go really bad
Review these if everything is fine with them
Before I did it, I got such logs at the end:
full_hash_files list should only be one after duplication removal, bad state
full_hash_files list should only be one after duplication removal, bad state
full_hash_files list should only be one after duplication removal, bad state
full_hash_files list should only be one after duplication removal, bad state
[ and so on for 41 lines ]
I fixed exit()
, and now there is one, and quits
I don't really know why this happens - maybe you know better, or I'll see it tommorow
if full_hash is not None and full_hash in files_by_full_hash: | ||
full_hash_files = files_by_full_hash[full_hash] | ||
if len(full_hash_files) != 1: | ||
print("full_hash_files list should only be one after duplication removal, bad state") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is always printed for me 😕
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was --keep-duplicates set to false true?
EDIT: I said false I meant true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, just standard -i -o
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, well, this means that this isn't getting called:
# Removes all duplicates in folder
def remove_duplicates(dir: Path):
find_duplicates(dir, lambda f: (is_photo(f) or is_video(f)))
nonlocal s_removed_duplicates_count
# Now we have populated the final multimap of absolute dups, We now can attempt to find the original file
# and remove all the other duplicates
for files in files_by_full_hash.values():
if len(files) < 2:
continue # this file size is unique, no need to spend cpu cycles on it
s_removed_duplicates_count += len(files) - 1
for file in files:
# TODO reconsider which dup we delete these now that we're searching globally?
if len(files) > 1:
file.unlink()
files.remove(file)
return True
Notice, we loop through the files and only keep the last one and that's why I make that assertion. So somehow we're avoiding that call. It gets called with just -i -o
on my end and I don't get those messages so i'm not sure how our code is different. I pulled the latest with no local changes :/
if not args.keep_duplicates:
print('=====================')
print('Removing duplicates...')
print('=====================')
remove_duplicates(
dir=FIXED_DIR
)
AFAIK, it would be if --keep-duplicates was true. See I edited above comment.
else: | ||
print('ERROR! There was literally no option to set date!!!') | ||
# TODO | ||
print('TODO: We should do something about this - move it to some separate folder, or write it down in ' | ||
'another .txt file...') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In such tragic scenario, I think we should do something different - I think the .txt and BIG disclaimer at the end will be enough
This solves that if we want to add it with one artifact. If I have two identical files with different extensions in the same folder, then this is an issue.
|
Oh... it was me...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TheLastGimbus you crack me up XD
Uhhh, @TheLastGimbus for some reason I no longer see Photos as an option from takeout.google.com . WTF? Update: sorry, I was on my corporate google account. XD |
You know what, when I disabled albums, script works fine. Probably noone is gonna use albums in I just removed unecessary options (you were pretty much right about them) and made albums optional 🎉 Thank you again for fixing this!I will do the rest of PRs (fix finding json) myself... probably... |
My pleasure!! Thanks for your continued work on this!! If I get more time I'll try to circle back and help out where I can! |
Last thing I'd like to ask here, could anyone who sees this give a star to the GitHub project for Trino? https://github.com/trinodb/trino This project was originally called PrestoSQL and is maintained by the original creators of Presto. They had to rename their project due to Facebook with the backing of Linux Foundation enforcing the trademark. Read more here. https://trino.io/blog/2020/12/27/announcing-trino.html This project was formed to maintain a pure and healthy open source project and needs the support to get more awareness than the Presto predecessor. Thank you! |
!time Edit: Just curiosity. Huh, so it is pretty much the same... so my CI is useless 😆 |
30 times 21 seconds, 1 time .7000 seconds, 11.000 ms per file |
Fixes: #10 #22 #30
This is still in draft for now just to get my current thoughts/progress out there so we can start discussions. Testing hasn't been done on a big takeout folder only a smaller subset.
Changes/Method: