Refactor duplication and Add Albums #36

bitsondatadev · 2020-12-13T05:58:41Z

This is still in draft for now just to get my current thoughts/progress out there so we can start discussions. Testing hasn't been done on a big takeout folder only a smaller subset.

Changes/Method:

I've updated duplication to happen after exif fix phase and file moving phase.
Duplication now scans globally vs per date/album folders.
You no longer have to remove album folders.
Album folders are scanned as well in case photos that don't exist in date folders are there.
Once all files are in the output folder, we scan that location for duplicates.
Once the duplicates are removed in the output folder, album folders are scanned once again and matched with the file that exists in the output folder already if it is a duplicate. (This part still doesn't check for duplicates yet i'm still thinking through this part).
albums are currently going to be exported via a json file. This will be easy to update to any other variations once this code is reviewed, tested an in place.

TheLastGimbus · 2020-12-13T09:36:59Z

😍 💖 🥇

Let me know if you would need any review/brainstoriming, or had any parts of the code I could help with without interrupting you

google_photos_takeout_helper/__main__.py

TheLastGimbus · 2020-12-13T09:42:14Z

google_photos_takeout_helper/__main__.py

-        print(file)
+        #print(file)


We probably neet to wrap the whole function in try, and print the file only in catch. I will do this when resolving #25

Haha yeah I will leave this alone since it isn't related I just need the logs to be a little less spammy 😜.

bitsondatadev · 2020-12-27T05:18:28Z

@TheLastGimbus I'm testing this now on a bigger takeout but you're free to start looking over changes. They are updated for the latest takeout format.

bitsondatadev · 2020-12-27T06:14:05Z

Final statistics:
Files copied to target folder: 25277
Removed duplicates: 31
Files where inserting correct exif failed: 1527

you have full list in /Users/brian/Downloads/Google Exports/Photos/Takeout 2/Google Photos/failed_inserting_exif.txt
Files where date was set from name of the folder: 367
(you have full list in /Users/brian/Downloads/Google Exports/Photos/Takeout 2/Google Photos/date_from_folder_name.txt)

Ran this on my takeout and it took about 30 min over an 85GB though it says files copied my folder is showing 17,461 items of the original 52,579 items. So it seems like some files weren't copied correctly maybe? I'm seeing 40.81GB in the output folder. Seems a bit small..i'll investigate a bit more tomorrow.

TheLastGimbus · 2020-12-27T08:47:38Z

I noticed that "file copied count" was incorrect a bit - but for me, it said 1578 instead of 1576, so I didn't care about it 😛
Edit: in v1.2.0

bitsondatadev · 2020-12-28T13:12:37Z

Okay, so there's 3046 pictures/videos that are missing in my output folder that were in the original folder. It seems to be due to a name clash and somehow didn't get resolved by new_name_if_exists (i.e. I didn't find the <name>(#).<suffix> image either). I also thought perhaps I deduped for another name so I checked by size of the remaining 3046 and also was a miss. The good news all files so far seem to be due to this name collision so I have a feeling these are happening due to the same bug. Will hopefully have something later this week!

EDIT: BTW I think this is something I've just introduced with my dedup stuff, not something that was preexisting so that nobody else worries :).

bitsondatadev · 2021-01-01T17:08:17Z

Okay, I figured out the issue. It was related to my suspicions of new_name_if_exists.

            if watch_for_duplicates:
                if new_name.stat().st_size == file.stat().st_size:
                    return file

The above code is now unnecessary with the new duplication checks that we do globally and was actually throwing of the rename.

https://github.com/bitsondatadev/GooglePhotosTakeoutHelper/blob/a36347fd1ea292065a488ae3c0ce00254d23d80c/google_photos_takeout_helper/__main__.py#L511-L513

Removed that and now it works as expected. going to run again on my takeout folder and compare again.

TheLastGimbus · 2021-01-01T17:43:26Z

As I understand, I can start reviewing this and it's almost ready to merge? 👀

Ps. Happy new year 🎉

bitsondatadev · 2021-01-01T18:44:47Z

Happy New Year!! 🎊 Yes I am running now and will compare the results shortly. It worked on a small dataset when I removed this! Here's hoping for no more artifacts.🤞

bitsondatadev · 2021-01-01T19:11:39Z

Okay this is looking a little better,

Final statistics:
Files copied to target folder: 25277
Removed duplicates: 4720
Files where inserting correct exif failed: 1527

you have full list in /Users/brian/Downloads/Google Exports/Photos/Takeout 2/Google Photos/failed_inserting_exif.txt
Files where date was set from name of the folder: 315
(you have full list in /Users/brian/Downloads/Google Exports/Photos/Takeout 2/Google Photos/date_from_folder_name.txt)

All previous files that had been missing before are now there. I don't see any other artifacts. It looks like the amount of photos are a little under half which makes sense due to the copies that were maintained in various albums. I will keep searching through but I think it's safe to say that you can start reviewing @TheLastGimbus.

TheLastGimbus · 2021-01-01T23:05:07Z

Files where date was set from name of the folder: 315

So we still have that issue to solve?

TheLastGimbus · 2021-01-03T15:32:37Z

google_photos_takeout_helper/__main__.py

+                #TODO reconsider now that we're searching globally
+                #check which duplicate has best exif?


Um

for files in files_by_full_hash

If two files have identical full hash, they also have identical exif - do they?

We can select ones that have corresponding json found, if that's what you meant

That's correct, I'll remove this comment.

I was trying to make an edge case where there was none 🤪.

We can probably still

select ones that have corresponding json

Just select one with shortest name (so no (1) stuff)

I'll just keep it a generic "optimize in some way" lol

google_photos_takeout_helper/__main__.py

bitsondatadev

Files where date was set from name of the folder: 315

So we still have that issue to solve?

So that's not a bug. There are a lot of edge cases that can cause this one example is stupid Google json naming:

ls "/Users/brian/Downloads/Google Exports/Photos/Takeout 2/Google Photos/Photos from 2013/EntityRelationshipDiagram"*
/Users/brian/Downloads/Google Exports/Photos/Takeout 2/Google Photos/Photos from 2013/EntityRelationshipDiagram.jpeg.json
/Users/brian/Downloads/Google Exports/Photos/Takeout 2/Google Photos/Photos from 2013/EntityRelationshipDiagram.jpg

Notice for some reason, google didn't use the same title as jpg vs jpeg.

We'll probably want to tackle one-by-one and may be a little out of scope for this PR.

bitsondatadev · 2021-01-04T01:45:05Z

google_photos_takeout_helper/__main__.py

+                #TODO reconsider now that we're searching globally
+                #check which duplicate has best exif?


I'll just keep it a generic "optimize in some way" lol

google_photos_takeout_helper/__main__.py

bitsondatadev · 2021-01-04T02:25:24Z

@TheLastGimbus I made one last fix to how albums.json was getting generated. There was an issue where we would write the name of a duplicate we deleted rather than the file that actually existed in the final "cut". I had to add one more dictionary to hold rename info but it's relatively small footprint.

There are still issues as I pointed out above due to Google namings that should get addressed in other PRs. Let me know if you see anything else otherwise I would say it's ready to merge.

README.md

bitsondatadev · 2021-01-04T02:49:20Z

Almost forgot to update the README. :) Now it's really ready!

google_photos_takeout_helper/__main__.py

TheLastGimbus · 2021-01-06T15:56:01Z

By the way:

Diagram.jpeg.json
Diagram.jpg

This is outstanding 0_o 🔥

TheLastGimbus · 2021-01-06T16:17:23Z

google_photos_takeout_helper/__main__.py

+                                    if len(full_hash_files) != 1:
+                                        print(
+                                            "full_hash_files list should only be one after duplication removal, bad state")
+                                        exit()


You've inserted exit() here, but because ~~there is no code number, it doesn't (i think so)~~ it was in a dummy "catch all"

I'm already heavliy editing this part, will want you to review after I'm done

TheLastGimbus · 2021-01-06T16:25:00Z

google_photos_takeout_helper/__main__.py

@@ -233,38 +268,30 @@ def find_duplicates(path: Path, filter_fun=lambda file: True):

                files_by_full_hash[full_hash].append(file)


files_by_full_hash is appended only in find_dupicates...

So populate_album_map can be only run when --keep-duplicates wasn't set, right?

We can also make some logic to externally run find_duplicates...

_{I know, all of this options and flags complicate everything 😕 I wanted to keep them in place in case some functions break, and to be able to just disable them, but now they just break existing options 😆}

I'm gonna make a suggestion since I don't have much time recently (due to work and my newborn) to look through all the if/else scenarios.

we get rid of --keep-duplicates option (always remove duplicates) --dont-copy (always copy) and --dont-fix options (always fix). I kind of made those assumptions when writing this code tbh.

I explain the rest of my PR here and hand off the torch to someone else to take over to handle these options.

TheLastGimbus

So I:

added flag to disable albums
removed unecessary indenations and try-catch alls - in:
- get_date_from_folder_meta - all of these catch alls make the code unsafe - I prefer people jamming me with printed errors rather than some super-hard-to-find bugs. find_album_meta_json_file already makes sure that it gives us valid album file 🙆
- populate_album_map - again, too broad catches could go really bad

Review these if everything is fine with them

Before I did it, I got such logs at the end:

full_hash_files list should only be one after duplication removal, bad state
full_hash_files list should only be one after duplication removal, bad state
full_hash_files list should only be one after duplication removal, bad state
full_hash_files list should only be one after duplication removal, bad state
[ and so on for 41 lines ]

I fixed exit(), and now there is one, and quits

I don't really know why this happens - maybe you know better, or I'll see it tommorow

TheLastGimbus · 2021-01-06T18:08:46Z

google_photos_takeout_helper/__main__.py

+                if full_hash is not None and full_hash in files_by_full_hash:
+                    full_hash_files = files_by_full_hash[full_hash]
+                    if len(full_hash_files) != 1:
+                        print("full_hash_files list should only be one after duplication removal, bad state")


This is always printed for me 😕

was --keep-duplicates set to ~~false~~ true?

EDIT: I said false I meant true.

Nope, just standard -i -o...

Okay, well, this means that this isn't getting called:

# Removes all duplicates in folder def remove_duplicates(dir: Path): find_duplicates(dir, lambda f: (is_photo(f) or is_video(f))) nonlocal s_removed_duplicates_count # Now we have populated the final multimap of absolute dups, We now can attempt to find the original file # and remove all the other duplicates for files in files_by_full_hash.values(): if len(files) < 2: continue # this file size is unique, no need to spend cpu cycles on it s_removed_duplicates_count += len(files) - 1 for file in files: # TODO reconsider which dup we delete these now that we're searching globally? if len(files) > 1: file.unlink() files.remove(file) return True

Notice, we loop through the files and only keep the last one and that's why I make that assertion. So somehow we're avoiding that call. It gets called with just -i -o on my end and I don't get those messages so i'm not sure how our code is different. I pulled the latest with no local changes :/

if not args.keep_duplicates: print('=====================') print('Removing duplicates...') print('=====================') remove_duplicates( dir=FIXED_DIR )

AFAIK, it would be if --keep-duplicates was true. See I edited above comment.

TheLastGimbus · 2021-01-06T18:10:52Z

google_photos_takeout_helper/__main__.py

+        else:
+            print('ERROR! There was literally no option to set date!!!')
+            # TODO
+            print('TODO: We should do something about this - move it to some separate folder, or write it down in '
+                  'another .txt file...')


In such tragic scenario, I think we should do something different - I think the .txt and BIG disclaimer at the end will be enough

bitsondatadev · 2021-01-10T03:48:35Z

By the way:

Diagram.jpeg.json
Diagram.jpg

This is outstanding 0_o 🔥

This solves that if we want to add it with one artifact. If I have two identical files with different extensions in the same folder, then this is an issue.

    # Returns json dict
    def find_json_for_file(file: Path):
        potential_json = list(file.parent.rglob(file.stem + ".*json"))
        if len(potential_json) != 0:
            try:
                with open(potential_json[0], 'r') as f:
                    dict = _json.load(f)
                return dict
            except:
                raise FileNotFoundError(f"Couldn't find json for file: {file}")
        else:
            raise FileNotFoundError(f"Couldn't find json for file: {file}")

Oh... it was me...

bitsondatadev

@TheLastGimbus you crack me up XD

bitsondatadev · 2021-01-10T16:06:43Z

Uhhh, @TheLastGimbus for some reason I no longer see Photos as an option from takeout.google.com . WTF?

Update: sorry, I was on my corporate google account. XD

TheLastGimbus · 2021-01-10T16:36:40Z

Honestly, I wouldn't be surprised...

TheLastGimbus · 2021-01-10T19:38:29Z

You know what, when I disabled albums, script works fine. Probably noone is gonna use albums in .json for now, so let's just get this closed!

I just removed unecessary options (you were pretty much right about them) and made albums optional 🎉

Thank you again for fixing this!

I will do the rest of PRs (fix finding json) myself... probably...

bitsondatadev · 2021-01-10T20:05:33Z

My pleasure!! Thanks for your continued work on this!! If I get more time I'll try to circle back and help out where I can!

bitsondatadev · 2021-01-10T23:16:33Z

Last thing I'd like to ask here, could anyone who sees this give a star to the GitHub project for Trino? https://github.com/trinodb/trino

This project was originally called PrestoSQL and is maintained by the original creators of Presto. They had to rename their project due to Facebook with the backing of Linux Foundation enforcing the trademark. Read more here.

https://trino.io/blog/2020/12/27/announcing-trino.html

This project was formed to maintain a pure and healthy open source project and needs the support to get more awareness than the Presto predecessor.

Thank you!

TheLastGimbus · 2021-02-14T23:33:27Z

!time

Edit: Just curiosity. Huh, so it is pretty much the same... so my CI is useless 😆

github-actions · 2021-02-14T23:34:13Z

30 times 21 seconds, 1 time .7000 seconds, 11.000 ms per file

TheLastGimbus reviewed Dec 13, 2020

View reviewed changes

google_photos_takeout_helper/__main__.py Show resolved Hide resolved

TheLastGimbus reviewed Dec 13, 2020

View reviewed changes

bitsondatadev force-pushed the handle-albums-with-hash branch from 69cf077 to 8fc32c8 Compare December 27, 2020 05:15

bitsondatadev marked this pull request as ready for review December 27, 2020 05:17

bitsondatadev force-pushed the handle-albums-with-hash branch from 8fc32c8 to fbddfba Compare December 27, 2020 06:08

bitsondatadev force-pushed the handle-albums-with-hash branch from fbddfba to 9a7f34f Compare January 1, 2021 19:10

bitsondatadev force-pushed the handle-albums-with-hash branch 3 times, most recently from 03104a3 to d863bd3 Compare January 1, 2021 21:24

TheLastGimbus reviewed Jan 3, 2021

View reviewed changes

google_photos_takeout_helper/__main__.py Show resolved Hide resolved

TheLastGimbus mentioned this pull request Jan 3, 2021

No folders named like the README instructions? #30

Closed

bitsondatadev commented Jan 4, 2021

View reviewed changes

bitsondatadev force-pushed the handle-albums-with-hash branch from d863bd3 to f4d4578 Compare January 4, 2021 02:19

bitsondatadev commented Jan 4, 2021

View reviewed changes

google_photos_takeout_helper/__main__.py Show resolved Hide resolved

Refactor duplication and Add Albums

c5de860

bitsondatadev force-pushed the handle-albums-with-hash branch from f4d4578 to c5de860 Compare January 4, 2021 02:45

bitsondatadev commented Jan 4, 2021

View reviewed changes

README.md Show resolved Hide resolved

TheLastGimbus reviewed Jan 6, 2021

View reviewed changes

google_photos_takeout_helper/__main__.py Outdated Show resolved Hide resolved

TheLastGimbus added 3 commits January 6, 2021 13:11

Add --no-albums flag and change albums.json location to input folder

b38ee48

Shorten readme about albums

069a788

Sorry for that

dc305da

TheLastGimbus reviewed Jan 6, 2021

View reviewed changes

google_photos_takeout_helper/__main__.py Outdated Show resolved Hide resolved

Apply suggestion from @varnav

63dad25

TheLastGimbus force-pushed the handle-albums-with-hash branch from 643a682 to 63dad25 Compare January 6, 2021 15:58

Note

c503fb5

TheLastGimbus reviewed Jan 6, 2021

View reviewed changes

TheLastGimbus added 2 commits January 6, 2021 18:41

Fix this indentation hell

da08041

Fix another indentation hell and add a TODO

4d11f43

TheLastGimbus requested changes Jan 6, 2021

View reviewed changes

Who wrote that shit

0b9e9b9

Oh... it was me...

bitsondatadev commented Jan 10, 2021

View reviewed changes

TheLastGimbus added 2 commits January 10, 2021 20:31

Remove unnecessary options, make albums optional

80ee305

Leave props

4e9dca7

TheLastGimbus merged commit d4ca298 into TheLastGimbus:master Jan 10, 2021

TheLastGimbus mentioned this pull request Jan 10, 2021

Try fixing recent Google's Takeout folder structure change. #41

Closed

TheLastGimbus mentioned this pull request Sep 10, 2021

Why are all images put into one folder? #98

Closed

TheLastGimbus mentioned this pull request Sep 15, 2022

Keep original folder structure #126

Closed

		#TODO reconsider now that we're searching globally
		#check which duplicate has best exif?

		@@ -233,38 +268,30 @@ def find_duplicates(path: Path, filter_fun=lambda file: True):

		files_by_full_hash[full_hash].append(file)

Refactor duplication and Add Albums #36

Refactor duplication and Add Albums #36

Conversation

bitsondatadev commented Dec 13, 2020 • edited by TheLastGimbus

TheLastGimbus commented Dec 13, 2020

😍 💖 🥇

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bitsondatadev commented Dec 27, 2020

bitsondatadev commented Dec 27, 2020

TheLastGimbus commented Dec 27, 2020 • edited

bitsondatadev commented Dec 28, 2020 • edited

bitsondatadev commented Jan 1, 2021

TheLastGimbus commented Jan 1, 2021 • edited

bitsondatadev commented Jan 1, 2021

bitsondatadev commented Jan 1, 2021

TheLastGimbus commented Jan 1, 2021

TheLastGimbus Jan 3, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bitsondatadev left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bitsondatadev commented Jan 4, 2021

bitsondatadev commented Jan 4, 2021

TheLastGimbus commented Jan 6, 2021

TheLastGimbus Jan 6, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TheLastGimbus left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bitsondatadev Jan 10, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bitsondatadev Jan 10, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bitsondatadev commented Jan 10, 2021

bitsondatadev left a comment

Choose a reason for hiding this comment

bitsondatadev commented Jan 10, 2021 • edited

TheLastGimbus commented Jan 10, 2021

TheLastGimbus commented Jan 10, 2021

Thank you again for fixing this!

bitsondatadev commented Jan 10, 2021

bitsondatadev commented Jan 10, 2021

TheLastGimbus commented Feb 14, 2021 • edited

github-actions bot commented Feb 14, 2021

bitsondatadev commented Dec 13, 2020 •

edited by TheLastGimbus

TheLastGimbus commented Dec 27, 2020 •

edited

bitsondatadev commented Dec 28, 2020 •

edited

TheLastGimbus commented Jan 1, 2021 •

edited

TheLastGimbus Jan 3, 2021 •

edited

TheLastGimbus Jan 6, 2021 •

edited

TheLastGimbus left a comment •

edited

bitsondatadev Jan 10, 2021 •

edited

bitsondatadev Jan 10, 2021 •

edited

bitsondatadev commented Jan 10, 2021 •

edited

TheLastGimbus commented Feb 14, 2021 •

edited