Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor duplication and Add Albums #36

Merged

Conversation

bitsondatadev
Copy link
Contributor

@bitsondatadev bitsondatadev commented Dec 13, 2020

Fixes: #10 #22 #30

This is still in draft for now just to get my current thoughts/progress out there so we can start discussions. Testing hasn't been done on a big takeout folder only a smaller subset.

Changes/Method:

  • I've updated duplication to happen after exif fix phase and file moving phase.
  • Duplication now scans globally vs per date/album folders.
  • You no longer have to remove album folders.
  • Album folders are scanned as well in case photos that don't exist in date folders are there.
  • Once all files are in the output folder, we scan that location for duplicates.
  • Once the duplicates are removed in the output folder, album folders are scanned once again and matched with the file that exists in the output folder already if it is a duplicate. (This part still doesn't check for duplicates yet i'm still thinking through this part).
  • albums are currently going to be exported via a json file. This will be easy to update to any other variations once this code is reviewed, tested an in place.

@TheLastGimbus
Copy link
Owner

😍 💖 🥇

Let me know if you would need any review/brainstoriming, or had any parts of the code I could help with without interrupting you

Comment on lines 465 to 488
print(file)
#print(file)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably neet to wrap the whole function in try, and print the file only in catch. I will do this when resolving #25

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha yeah I will leave this alone since it isn't related I just need the logs to be a little less spammy 😜.

@bitsondatadev bitsondatadev marked this pull request as ready for review December 27, 2020 05:17
@bitsondatadev
Copy link
Contributor Author

@TheLastGimbus I'm testing this now on a bigger takeout but you're free to start looking over changes. They are updated for the latest takeout format.

@bitsondatadev
Copy link
Contributor Author

Final statistics:
Files copied to target folder: 25277
Removed duplicates: 31
Files where inserting correct exif failed: 1527

  • you have full list in /Users/brian/Downloads/Google Exports/Photos/Takeout 2/Google Photos/failed_inserting_exif.txt
    Files where date was set from name of the folder: 367
    (you have full list in /Users/brian/Downloads/Google Exports/Photos/Takeout 2/Google Photos/date_from_folder_name.txt)

Ran this on my takeout and it took about 30 min over an 85GB though it says files copied my folder is showing 17,461 items of the original 52,579 items. So it seems like some files weren't copied correctly maybe? I'm seeing 40.81GB in the output folder. Seems a bit small..i'll investigate a bit more tomorrow.

@TheLastGimbus
Copy link
Owner

TheLastGimbus commented Dec 27, 2020

I noticed that "file copied count" was incorrect a bit - but for me, it said 1578 instead of 1576, so I didn't care about it 😛
Edit: in v1.2.0

@bitsondatadev
Copy link
Contributor Author

bitsondatadev commented Dec 28, 2020

Okay, so there's 3046 pictures/videos that are missing in my output folder that were in the original folder. It seems to be due to a name clash and somehow didn't get resolved by new_name_if_exists (i.e. I didn't find the <name>(#).<suffix> image either). I also thought perhaps I deduped for another name so I checked by size of the remaining 3046 and also was a miss. The good news all files so far seem to be due to this name collision so I have a feeling these are happening due to the same bug. Will hopefully have something later this week!

EDIT: BTW I think this is something I've just introduced with my dedup stuff, not something that was preexisting so that nobody else worries :).

@bitsondatadev
Copy link
Contributor Author

Okay, I figured out the issue. It was related to my suspicions of new_name_if_exists.

            if watch_for_duplicates:
                if new_name.stat().st_size == file.stat().st_size:
                    return file

The above code is now unnecessary with the new duplication checks that we do globally and was actually throwing of the rename.

https://github.com/bitsondatadev/GooglePhotosTakeoutHelper/blob/a36347fd1ea292065a488ae3c0ce00254d23d80c/google_photos_takeout_helper/__main__.py#L511-L513

Removed that and now it works as expected. going to run again on my takeout folder and compare again.

@TheLastGimbus
Copy link
Owner

TheLastGimbus commented Jan 1, 2021

As I understand, I can start reviewing this and it's almost ready to merge? 👀

Ps. Happy new year 🎉

@bitsondatadev
Copy link
Contributor Author

Happy New Year!! 🎊 Yes I am running now and will compare the results shortly. It worked on a small dataset when I removed this! Here's hoping for no more artifacts.🤞

@bitsondatadev
Copy link
Contributor Author

Okay this is looking a little better,

Final statistics:
Files copied to target folder: 25277
Removed duplicates: 4720
Files where inserting correct exif failed: 1527

  • you have full list in /Users/brian/Downloads/Google Exports/Photos/Takeout 2/Google Photos/failed_inserting_exif.txt
    Files where date was set from name of the folder: 315
    (you have full list in /Users/brian/Downloads/Google Exports/Photos/Takeout 2/Google Photos/date_from_folder_name.txt)

All previous files that had been missing before are now there. I don't see any other artifacts. It looks like the amount of photos are a little under half which makes sense due to the copies that were maintained in various albums. I will keep searching through but I think it's safe to say that you can start reviewing @TheLastGimbus.

@bitsondatadev bitsondatadev force-pushed the handle-albums-with-hash branch 3 times, most recently from 03104a3 to d863bd3 Compare January 1, 2021 21:24
@TheLastGimbus
Copy link
Owner

Files where date was set from name of the folder: 315

So we still have that issue to solve?

Comment on lines 267 to 268
#TODO reconsider now that we're searching globally
#check which duplicate has best exif?
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Um

for files in files_by_full_hash

If two files have identical full hash, they also have identical exif - do they?

We can select ones that have corresponding json found, if that's what you meant

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct, I'll remove this comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to make an edge case where there was none 🤪.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can probably still

select ones that have corresponding json

Just select one with shortest name (so no (1) stuff)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll just keep it a generic "optimize in some way" lol

Copy link
Contributor Author

@bitsondatadev bitsondatadev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Files where date was set from name of the folder: 315

So we still have that issue to solve?

So that's not a bug. There are a lot of edge cases that can cause this one example is stupid Google json naming:

ls "/Users/brian/Downloads/Google Exports/Photos/Takeout 2/Google Photos/Photos from 2013/EntityRelationshipDiagram"*
/Users/brian/Downloads/Google Exports/Photos/Takeout 2/Google Photos/Photos from 2013/EntityRelationshipDiagram.jpeg.json
/Users/brian/Downloads/Google Exports/Photos/Takeout 2/Google Photos/Photos from 2013/EntityRelationshipDiagram.jpg

Notice for some reason, google didn't use the same title as jpg vs jpeg.

We'll probably want to tackle one-by-one and may be a little out of scope for this PR.

Comment on lines 267 to 268
#TODO reconsider now that we're searching globally
#check which duplicate has best exif?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll just keep it a generic "optimize in some way" lol

@bitsondatadev
Copy link
Contributor Author

@TheLastGimbus I made one last fix to how albums.json was getting generated. There was an issue where we would write the name of a duplicate we deleted rather than the file that actually existed in the final "cut". I had to add one more dictionary to hold rename info but it's relatively small footprint.

There are still issues as I pointed out above due to Google namings that should get addressed in other PRs. Let me know if you see anything else otherwise I would say it's ready to merge.

README.md Show resolved Hide resolved
@bitsondatadev
Copy link
Contributor Author

Almost forgot to update the README. :) Now it's really ready!

@TheLastGimbus
Copy link
Owner

By the way:

Diagram.jpeg.json
Diagram.jpg

This is outstanding 0_o 🔥

if len(full_hash_files) != 1:
print(
"full_hash_files list should only be one after duplication removal, bad state")
exit()
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You've inserted exit() here, but because there is no code number, it doesn't (i think so) it was in a dummy "catch all"

I'm already heavliy editing this part, will want you to review after I'm done

@@ -233,38 +268,30 @@ def find_duplicates(path: Path, filter_fun=lambda file: True):

files_by_full_hash[full_hash].append(file)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

files_by_full_hash is appended only in find_dupicates...

So populate_album_map can be only run when --keep-duplicates wasn't set, right?

We can also make some logic to externally run find_duplicates...

I know, all of this options and flags complicate everything 😕 I wanted to keep them in place in case some functions break, and to be able to just disable them, but now they just break existing options 😆

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm gonna make a suggestion since I don't have much time recently (due to work and my newborn) to look through all the if/else scenarios.

  1. we get rid of --keep-duplicates option (always remove duplicates) --dont-copy (always copy) and --dont-fix options (always fix). I kind of made those assumptions when writing this code tbh.

  2. I explain the rest of my PR here and hand off the torch to someone else to take over to handle these options.

Copy link
Owner

@TheLastGimbus TheLastGimbus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I:

  • added flag to disable albums
  • removed unecessary indenations and try-catch alls - in:
    • get_date_from_folder_meta - all of these catch alls make the code unsafe - I prefer people jamming me with printed errors rather than some super-hard-to-find bugs. find_album_meta_json_file already makes sure that it gives us valid album file 🙆
    • populate_album_map - again, too broad catches could go really bad

Review these if everything is fine with them

Before I did it, I got such logs at the end:

full_hash_files list should only be one after duplication removal, bad state
full_hash_files list should only be one after duplication removal, bad state
full_hash_files list should only be one after duplication removal, bad state
full_hash_files list should only be one after duplication removal, bad state
[ and so on for 41 lines ]

I fixed exit(), and now there is one, and quits

I don't really know why this happens - maybe you know better, or I'll see it tommorow

if full_hash is not None and full_hash in files_by_full_hash:
full_hash_files = files_by_full_hash[full_hash]
if len(full_hash_files) != 1:
print("full_hash_files list should only be one after duplication removal, bad state")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is always printed for me 😕

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was --keep-duplicates set to false true?

EDIT: I said false I meant true.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, just standard -i -o...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, well, this means that this isn't getting called:

 # Removes all duplicates in folder
    def remove_duplicates(dir: Path):
        find_duplicates(dir, lambda f: (is_photo(f) or is_video(f)))
        nonlocal s_removed_duplicates_count

        # Now we have populated the final multimap of absolute dups, We now can attempt to find the original file
        # and remove all the other duplicates
        for files in files_by_full_hash.values():
            if len(files) < 2:
                continue  # this file size is unique, no need to spend cpu cycles on it

            s_removed_duplicates_count += len(files) - 1
            for file in files:
                # TODO reconsider which dup we delete these now that we're searching globally?
                if len(files) > 1:
                    file.unlink()
                    files.remove(file)
        return True

Notice, we loop through the files and only keep the last one and that's why I make that assertion. So somehow we're avoiding that call. It gets called with just -i -o on my end and I don't get those messages so i'm not sure how our code is different. I pulled the latest with no local changes :/

    if not args.keep_duplicates:
        print('=====================')
        print('Removing duplicates...')
        print('=====================')
        remove_duplicates(
            dir=FIXED_DIR
        )

AFAIK, it would be if --keep-duplicates was true. See I edited above comment.

Comment on lines +540 to +544
else:
print('ERROR! There was literally no option to set date!!!')
# TODO
print('TODO: We should do something about this - move it to some separate folder, or write it down in '
'another .txt file...')
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In such tragic scenario, I think we should do something different - I think the .txt and BIG disclaimer at the end will be enough

@bitsondatadev
Copy link
Contributor Author

By the way:

Diagram.jpeg.json
Diagram.jpg

This is outstanding 0_o 🔥

This solves that if we want to add it with one artifact. If I have two identical files with different extensions in the same folder, then this is an issue.

    # Returns json dict
    def find_json_for_file(file: Path):
        potential_json = list(file.parent.rglob(file.stem + ".*json"))
        if len(potential_json) != 0:
            try:
                with open(potential_json[0], 'r') as f:
                    dict = _json.load(f)
                return dict
            except:
                raise FileNotFoundError(f"Couldn't find json for file: {file}")
        else:
            raise FileNotFoundError(f"Couldn't find json for file: {file}")

Oh... it was me...
Copy link
Contributor Author

@bitsondatadev bitsondatadev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TheLastGimbus you crack me up XD

@bitsondatadev
Copy link
Contributor Author

bitsondatadev commented Jan 10, 2021

Uhhh, @TheLastGimbus for some reason I no longer see Photos as an option from takeout.google.com . WTF?

Update: sorry, I was on my corporate google account. XD

@TheLastGimbus
Copy link
Owner

Honestly, I wouldn't be surprised...

3d66cd1

@TheLastGimbus
Copy link
Owner

You know what, when I disabled albums, script works fine. Probably noone is gonna use albums in .json for now, so let's just get this closed!

I just removed unecessary options (you were pretty much right about them) and made albums optional 🎉

Thank you again for fixing this!

I will do the rest of PRs (fix finding json) myself... probably...

@TheLastGimbus TheLastGimbus merged commit d4ca298 into TheLastGimbus:master Jan 10, 2021
@bitsondatadev
Copy link
Contributor Author

My pleasure!! Thanks for your continued work on this!! If I get more time I'll try to circle back and help out where I can!

@bitsondatadev
Copy link
Contributor Author

Last thing I'd like to ask here, could anyone who sees this give a star to the GitHub project for Trino? https://github.com/trinodb/trino

This project was originally called PrestoSQL and is maintained by the original creators of Presto. They had to rename their project due to Facebook with the backing of Linux Foundation enforcing the trademark. Read more here.

https://trino.io/blog/2020/12/27/announcing-trino.html

This project was formed to maintain a pure and healthy open source project and needs the support to get more awareness than the Presto predecessor.

Thank you!

@TheLastGimbus
Copy link
Owner

TheLastGimbus commented Feb 14, 2021

!time

Edit: Just curiosity. Huh, so it is pretty much the same... so my CI is useless 😆

@github-actions
Copy link

30 times 21 seconds, 1 time .7000 seconds, 11.000 ms per file

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use hashing to determine albums.
2 participants