Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not all photos present in yyyy-mm--dd named folders #22

Closed
rtadams89 opened this issue Dec 4, 2020 · 17 comments
Closed

Not all photos present in yyyy-mm--dd named folders #22

rtadams89 opened this issue Dec 4, 2020 · 17 comments
Labels
help wanted Extra attention is needed important Really important problem that is messing people's photos
Milestone

Comments

@rtadams89
Copy link
Contributor

The instructions indicate

Before running this script, you need to cut out all folders that aren't dates
That is, all album folders, and everything that isn't named
2016-06-16 (or with "#", they are good)
See README.md or --help on why
(Don't worry, your photos from albums are already in some date folder)

This is however not true in my experience. I have exported photos from two different Google Accounts and each contains hundreds of photos that exist in custom named album folders which do not exist in any of the folders named yyyy-mm-dd.

@TheLastGimbus
Copy link
Owner

Are you sure? Try to find those as hard as possible - because if it's true, then we have a problem 😕

@rtadams89
Copy link
Contributor Author

Well, here's what I did to confirm this:

Downloaded the takeout zip file and extracted the contents to ~/takeout
Moved all subfolders that weren't named as a date to ~/takeoutalbums
Ran find -type f -name "*.jpg" -exec md5sum '{}' \; > md5sumdates.txt in the ~/takeout directory and ran find -type f -name "*.jpg" -exec md5sum '{}' \; > md5sumalbums.txt in the ~/takeoutalbums directory
With the two listings of file hashes, I opened both in Excel to do some formatting/comparison.

The date based folders contained 4151 jpg files. After deduplicating the hashes, 3374 unique hashes/jpg files remained. The album based folders contained 5268 jpg files. After deduplicating the hashes, 5147 unique hashes/jpg files remained. Already this indicates an issue as there are more unique hashes in the album folders than in the date folders.

I used the Excel MATCH function to compare the hashes found in the dates folder vs the albums folder. There were 449 hashes/jpg files that existed in the date based folders that did not exist in the album based folders. The real concern is that there are 2222 hashes/jpgs found in the album based folders that did not exist in the date based folders.

@bitsondatadev
Copy link
Contributor

This can be solved by #10 once we come to an agreement on how we handle album/dir info. I notice a lot of names directories were folders I initially uploaded to Google vs photos that were synced directly to the service.

@rtadams89
Copy link
Contributor Author

I would appreciate true support for albums. That said, it seems prudent to more immediately either build in support for folders named after albums now (across my 95 album folders, I only had 3 jpgs which would have required using the folder name to derive date -- maybe just skip processing those instead of aborting the whole script), or at least remove the notice

(Don't worry, your photos from albums are already in some date folder)

and replace it with a proper warning instead.

@antimofm
Copy link

antimofm commented Dec 4, 2020

Good thing I kept the original Takeout archives :P

@bitsondatadev
Copy link
Contributor

bitsondatadev commented Dec 4, 2020

thing I kept the original Takeout archives :P

You could've also ran another takeout. Unless you preemptively deleted all your Google Photos.

@bitsondatadev
Copy link
Contributor

bitsondatadev commented Dec 4, 2020

more immediately either build in support for folders named after albums now (across my 95 album folders, I only had 3 jpgs which would have required using the folder name to derive date -- maybe just skip processing those instead of aborting the whole script), or at least remove the notice

I plan to build support for both folders and albums in some capacity in #10. In #11 I added the hashing needed to compare files to determine on an image level if they are the same photo or not. Extending this to identify which photos belong to a list of photo albums or folders should be straight forward now. The main issue is, how do we output this?

  1. Create a json file that contains a list of image names per tag. (not non-developer friendly)
  2. Keep duplicate images in the folders, as well as, in the root folder. (easy to understand but...duplicates bad)
  3. We looked into using shortcuts but this wont work on all filesystems including object store.

So it's just not clear what exactly to do yet...fastest solution that everyone can understand is option 2 so I think I would start there.

@rtadams89
Copy link
Contributor Author

I have some thoughts on that which I can add under issue #10 / #11, but as that seems like a longer term effort, should the documentation and code be updated now to better indicate that photos in Albums may be lost with the current code?

@bitsondatadev
Copy link
Contributor

I think that makes sense.

@rtadams89
Copy link
Contributor Author

I've made the wording changes in readme and code. I don't currently have permissions to the repository, and to be honest, probably won't have time to work on #10 or #11 in the future. For now, I'm breaking protocol and added my changes into a fork at https://github.com/rtadams89/GooglePhotosTakeoutHelper/commit/63b6e4d56dac5988ae9bd1b50c825816e73d7212

@bitsondatadev
Copy link
Contributor

No protocol broken you can just make a pull request from your fork.

TheLastGimbus added a commit that referenced this issue Dec 4, 2020
@rtadams89
Copy link
Contributor Author

That should resolve this issue. I'll take a look at #10 and #11 when I get a little time.

@TheLastGimbus
Copy link
Owner

Looking at all of this, and wrapping my thoughts:

I notice a lot of names directories were folders I initially uploaded to Google vs photos that were synced directly to the service.

Okay, so looks like the case where "photos from album folders are not in date folders" touches people who uploaded whole folders/bunch of photos through desktop app/somehow else. 99% of people just download the app and let it run, so I'm calm that my code didn't break the photos for a lot of people - but this still needs to be fixed

I think we have clear path of what needs to be done:

  1. Fix JSON naming too long? #8 and all "json not found" errors by finding jsons based on it's "title" tag, instead of it's file-name. I think that should reduce the number of cases where .json file was not found to near 0 (or even literally 0! )

With this done, we could let the script run inside the "album folders", and just not copy duplicates.

If the number of "json not found" errors is near 0, we could just move those files to some special "failed" folder, to be handled manually by the user later.

Tho, 99% of people do have near-full duplicated albums, so it would generally slow it down because of the hashing thing. So:

  1. Add support for albums. How?

Create a json file / Keep duplicate images in the folders, (as well as / aditionally) in the root folder / using shortcuts

Why not let the user decide? "You can have them by shortcuts, but that may not work on all systems, or just copied to separate folder - which of those/maybe both?"

By the way - I should probably merge #18 before making any above changes, it will make stuff easier

Oh, you closed this while I was writing this 😅 This issue is very much open, and adding a warning does not solve that 😕

@TheLastGimbus TheLastGimbus reopened this Dec 4, 2020
@TheLastGimbus TheLastGimbus pinned this issue Dec 4, 2020
@TheLastGimbus TheLastGimbus added the important Really important problem that is messing people's photos label Dec 4, 2020
@TheLastGimbus TheLastGimbus added this to the Albums milestone Dec 4, 2020
@TheLastGimbus TheLastGimbus added the help wanted Extra attention is needed label Dec 4, 2020
@bitsondatadev
Copy link
Contributor

@TheLastGimbus Re: 2. I agree, let's let the user decide is great but I don't want to implement those all in one go. I want to prioritize one method. For me, I think the simplest thing is to create duplicates in another folder. Once that works, implementing the variants should be very straight forward. So I will start by implementing that variation first and get it merged. Then myself or others can add the shortcut/json version after.

@antimofm
Copy link

antimofm commented Dec 7, 2020

You could've also ran another takeout. Unless you preemptively deleted all your Google Photos.

I didn't delete my Google Photos, but if possible I'll avoid running another Takeout. I'm just annoyed at the fact that you can download individual Takeout archives only once. I actually had to run 3 Takeouts before I managed to download them: the first time I canceled the download because the destination I chose wouldn't have enough space, the second time the page crashed while downloading and it still wouldn't re-download the archive. Third time lucky...

@bitsondatadev
Copy link
Contributor

You could've also ran another takeout. Unless you preemptively deleted all your Google Photos.

I didn't delete my Google Photos, but if possible I'll avoid running another Takeout. I'm just annoyed at the fact that you can download individual Takeout archives only once. I actually had to run 3 Takeouts before I managed to download them: the first time I canceled the download because the destination I chose wouldn't have enough space, the second time the page crashed while downloading and it still wouldn't re-download the archive. Third time lucky...

You shouldn't need to stay on the same page to download your takeout. Once it's started you can return to https://takeout.google.com/takeout/downloads to see progress and download finished takeouts that remain for a few days after the takeout is completed.

@TheLastGimbus
Copy link
Owner

TheLastGimbus commented Jan 10, 2021

🎉

Will push new version to PyPi very soon

pip install google-photos-takeout-helper==2.0.0rc1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed important Really important problem that is messing people's photos
Projects
None yet
Development

No branches or pull requests

4 participants