Skip to content

Commit

Permalink
Refactor duplication and Add Albums
Browse files Browse the repository at this point in the history
  • Loading branch information
bitsondatadev committed Dec 27, 2020
1 parent 686bbf9 commit fbddfba
Show file tree
Hide file tree
Showing 2 changed files with 117 additions and 94 deletions.
17 changes: 3 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,7 @@ It will take all of your photos from those tiny folders, set their `exif` and `l
0. Get all your photos in [Google Takeout](https://takeout.google.com/) (select only Google Photos)
1. `pip3 install -U google-photos-takeout-helper`
2. Extract all contents from your Google Takeout to one folder
3. Cut out/remove all ["album folders"](#why-do-you-need-to-cut-out-albums) that aren't named "2016-06-16" or something like that
4. Run `google-photos-takeout-helper -i [INPUT TAKEOUT FOLDER] -o [OUTPUT FOLDER]`
3. Run `google-photos-takeout-helper -i [INPUT TAKEOUT FOLDER] -o [OUTPUT FOLDER]`

Alternatively, if you don't have PATH set right, you can call it `python3 -m google_photos_takeout_helper`

Expand Down Expand Up @@ -45,9 +44,6 @@ If something goes wrong and it prints some red errors, try to add ` --user` flag
3. Prepare your Takeout:

If your Takeout was dividied into multiple `.zip`s, you will need to extract them, and move their contents into one folder.

Because I don't have good solution on how to handle albums, you will need to cut off all ["Album folders"](#why-do-you-need-to-cut-out-albums) - those who are not named like "2016-06-26" or "2016-06-26 #2" - don't worry, all photos from albums are in corresponding "date folders" already - they would just make a duplicate.

Now, you should be able to just run it straight in cmd/terminal:

4. `google-photos-takeout-helper -i [INPUT TAKEOUT FOLDER] -o [OUTPUT FOLDER]`
Expand All @@ -64,13 +60,6 @@ If you have issues/questions, you can hit me up either by [Reddit](https://www.r
</p>
</details>

### Why do you need to cut out albums?
They mostly contain duplicates of same photos that are in corresponding "date folder". (Note: not ALL photos found in album folders will be duplicated in date folders. You should maintain a separate backup of the original Google Takeout folder/zip to ensure you don't lose any photos. See [Issue #22](https://github.com/TheLastGimbus/GooglePhotosTakeoutHelper/issues/22) for more details)
This script tries to get all "photo taken time" stuff right. If it finds json - it sets everything from that json (it contains data of edited timestamp that you might've corrected in Google Photos). If it can't - it tries to get Exif data form photo.
IF it can't find anything like that, it sets date from folder name.

All of this is so that you can then safely store ALL of your photos in one folder, and they will all be in right order.

#### Unless you move them around your Android phone.
Beware, that (99% of the times), if you move some files in Android, their creation and modification time is reseted to current.

Expand All @@ -83,7 +72,7 @@ https://github.com/SimpleMobileTools/Simple-Gallery

- If you want something more centralized but also self-hosted, [Nextcloud](https://nextcloud.com) is a nice choice, but it's approach to photos is still not perfect. (And you need to set up your own server)

- Guys at [Photoprims](https://photoprism.org/) are working on full Google Photos alternative, with search and AI tagging etc, but it's stil work in progress. (I will edit this when they are done, but can't promise :P )
- Guys at [Photoprism](https://photoprism.org/) are working on full Google Photos alternative, with search and AI tagging etc, but it's stil work in progress. (I will edit this when they are done, but can't promise :P )


#### Other Takeout projects
Expand All @@ -100,4 +89,4 @@ https://github.com/HardFork/KeepToText
### TODO (Pull Requests welcome):
- [ ] Videos' Exif data
- [x] Gps data: from JSON to Exif - Thank you @DalenW :sparkling_heart:
- [ ] Some way to handle albums - Kinda WIP in #10
- [x] Some way to handle albums - Done!
194 changes: 114 additions & 80 deletions google_photos_takeout_helper/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,9 +20,6 @@ def main():
"""This script takes all of your photos form Google Photos takeout,
fixes their exif DateTime data (when they were taken) and file creation date,
and then copies it all to one folder.
"Why do I need to delete album folders?"
-They mostly contain duplicates of same photos that are in corresponding "date folder" :/
You need to do this before running this. (Note: not ALL photos found in album folders will be duplicated in date folders. You should maintain a separate backup of the original Google Takeout folder/zip to ensure you don't lose any photos. See [Issue #22](https://github.com/TheLastGimbus/GooglePhotosTakeoutHelper/issues/22) for more details)
""",
)
parser.add_argument(
Expand Down Expand Up @@ -73,20 +70,7 @@ def main():
)
args = parser.parse_args()

print('DISCLAIMER!')
print("Before running this script, you need to cut out all folders that aren't dates")
print("That is, all album folders, and everything that isn't named")
print('2016-06-16 (or with "#", they are good)')
print('See README.md or --help on why')
print("(Note: not ALL photos found in album folders will be duplicated in date folders. You should maintain a separate backup of the original Google Takeout folder/zip to ensure you don't lose any photos. See [Issue #22](https://github.com/TheLastGimbus/GooglePhotosTakeoutHelper/issues/22) for more details)")
print()
print('Type "yes i did that" to confirm:')
response = input()
if response.lower() == 'yes i did that':
print('Heeeere we go!')
else:
print('Ok come back when you do this')
exit(-2)
print('Heeeere we go!')

PHOTOS_DIR = Path(args.input_folder)
FIXED_DIR = Path(args.output_folder)
Expand All @@ -104,6 +88,14 @@ def main():
# Add more "edited" flags in more languages if you want. They need to be lowercase.
]

#Album Multimap
album_mmap = _defaultdict(list)

#Duplicate by full hash multimap
files_by_full_hash = _defaultdict(list)

meta_file_memo = dict()

# Statistics:
s_removed_duplicates_count = 0
s_copied_files = 0
Expand Down Expand Up @@ -174,7 +166,37 @@ def get_hash(file: Path, first_chunk_only=False, hash_algo=_hashlib.sha1):
hashobj.update(chunk)
return hashobj.digest()

# PART 1: removing duplicates
def populate_album_map(path: Path, filter_fun=lambda f: (is_photo(f) or is_video(f))):
if not path.is_dir():
raise NotADirectoryError('populate_album_map only handles directories not files')
try:
meta_file_exists = find_album_meta_json_file(path)

if meta_file_exists: #means that we are processing an album so process
for file in path.rglob("*"):
if file.is_file() and filter_fun(file):
file_name = file.name
if not Path(str(FIXED_DIR) + "/" + file.name).is_file():
try:
full_hash = get_hash(file, first_chunk_only=False)
if full_hash in files_by_full_hash:
full_hash_files = files_by_full_hash[full_hash]
if len(full_hash_files) != 1:
print("full_hash_files list should only be one after duplication remvoal, bad state")
exit()
else:
full_hash_file = full_hash_files[0]
file_name = full_hash_file.name


except:
pass
album_mmap[file.parent.name].append(file_name)
except:
pass


# PART 3: removing duplicates

# THIS IS PARTLY COPIED FROM STACKOVERFLOW
# https://stackoverflow.com/questions/748675/finding-duplicate-files-and-removing-them
Expand All @@ -190,10 +212,6 @@ def get_hash(file: Path, first_chunk_only=False, hash_algo=_hashlib.sha1):
def find_duplicates(path: Path, filter_fun=lambda file: True):
files_by_size = _defaultdict(list)
files_by_small_hash = _defaultdict(list)
files_by_full_hash = _defaultdict(list)

# Excluding original files (or first file if original not found)
duplicates = []

for file in path.rglob("*"):
if file.is_file() and filter_fun(file):
Expand Down Expand Up @@ -233,34 +251,27 @@ def find_duplicates(path: Path, filter_fun=lambda file: True):

files_by_full_hash[full_hash].append(file)

# Now we have the final multimap of absolute dups, We now can attempt to find the original file
# Removes all duplicates in folder
def remove_duplicates(dir: Path):
find_duplicates(dir, lambda f: (is_photo(f) or is_video(f)))
nonlocal s_removed_duplicates_count

# Now we have populated the final multimap of absolute dups, We now can attempt to find the original file
# and remove all the other duplicates
for files in files_by_full_hash.values():
if len(files) < 2:
continue # this file size is unique, no need to spend cpu cycles on it
original = None
for file in files:
if not _re.search(r'\(\d+\).', file.name):
original = file
if original is None:
original = files[0]

dups = files.copy()
dups.remove(original)
duplicates += dups

return duplicates

# Removes all duplicates in folder
def remove_duplicates(dir: Path):
duplicates = find_duplicates(dir, lambda f: (is_photo(f) or is_video(f)))
for file in duplicates:
file.unlink()
nonlocal s_removed_duplicates_count
s_removed_duplicates_count += len(duplicates)
s_removed_duplicates_count += len(files) - 1
for file in files:
#TODO reconsider now that we're searching globally
#check which duplicate has best exif?
if len(files) > 1:
file.unlink()
files.remove(file)
return True

# PART 2: Fixing metadata and date-related stuff
# PART 1: Fixing metadata and date-related stuff

# Returns json dict
def find_json_for_file(file: Path):
Expand All @@ -276,30 +287,42 @@ def find_json_for_file(file: Path):
raise FileNotFoundError(f"Couldn't find json for file: {file}")

# Returns date in 2019:01:01 23:59:59 format
def get_date_from_folder_name(dir: Path):
dir = dir.name
dir = dir[:10].replace('-', ':').replace(' ', ':') + ' 12:00:00'
def get_date_from_folder_meta(dir: Path):
try:
file = find_album_meta_json_file(dir)
if file:
try:
with open(str(file), 'r') as f:
dict = _json.load(f)
if "date" in dict["albumData"]:
if "timestamp" in dict["albumData"]["date"]:
return _datetime.fromtimestamp(int(dict["albumData"]["date"]["timestamp"])).strftime('%Y:%m:%d %H:%M:%S')
except:
pass
except:
pass

print("Couldn't pull datetime from album meta")
return None

def find_album_meta_json_file(dir: Path):
if str(dir) in meta_file_memo:
return meta_file_memo[str(dir)]

for file in dir.rglob("*.json"):
try:
with open(str(file), 'r') as f:
dict = _json.load(f)
if "albumData" in dict:
meta_file_memo[str(dir)] = file
return file
except Exception as e:
print(e)
raise FileNotFoundError(f"find_album_meta_json_file - Couldn't find json for file: {file}")

return None

# Sometimes google exports folders without the -, like 2009 08 30...
# So the end result would be 2009 08 30 12:00:00, which does not match the format.
# Therefore, we also replace the spaces with ':'

# Reformat it to check if it matcher, and quit if doesn't match - it's probably a date folder
try:
return _datetime.strptime(dir, '%Y:%m:%d %H:%M:%S').strftime('%Y:%m:%d %H:%M:%S')
except ValueError as e:
print()
print(e)
print()
print('==========!!!==========')
print(f"Wrong folder name: {dir}")
print("You probably forgot to remove 'album folders' from your takeout folder")
print("Please do that - see README.md or --help for why")
print("https://github.com/TheLastGimbus/GooglePhotosTakeoutHelper#why-do-you-need-to-cut-out-albums")
print()
print('Once you do this, just run it again :)')
print('==========!!!==========')
exit(-1)

def set_creation_date_from_str(file: Path, str_datetime):
try:
Expand Down Expand Up @@ -462,7 +485,7 @@ def set_file_geo_data(file: Path, json):

# Fixes ALL metadata, takes just file and dir and figures it out
def fix_metadata(file: Path):
print(file)
#print(file)

has_nice_date = False
try:
Expand All @@ -483,22 +506,23 @@ def fix_metadata(file: Path):
has_nice_date = True
return
except FileNotFoundError:
print("Couldn't find json for file :/")
print("Couldn't find json for file ")

if has_nice_date:
return

print('Last chance, coping folder name as date...')
date = get_date_from_folder_name(file.parent)
set_file_exif_date(file, date)
set_creation_date_from_str(file, date)
print('Last chance, coping folder meta as date...')
date = get_date_from_folder_meta(file.parent)
if date:
set_file_exif_date(file, date)
set_creation_date_from_str(file, date)

nonlocal s_date_from_folder_files
s_date_from_folder_files.append(str(file.resolve()))

return True

# PART 3: Copy all photos and videos to target folder
# PART 2: Copy all photos and videos to target folder

# Makes a new name like 'photo(1).jpg'
def new_name_if_exists(file: Path, watch_for_duplicates=True):
Expand Down Expand Up @@ -537,14 +561,6 @@ def copy_to_target_and_divide(file: Path):
s_copied_files += 1
return True

if not args.keep_duplicates:
print('=====================')
print('Removing duplicates...')
print('=====================')
for_all_files_recursive(
dir=PHOTOS_DIR,
folder_function=remove_duplicates
)
if not args.dont_fix:
print('=====================')
print('Fixing files metadata and creation dates...')
Expand Down Expand Up @@ -574,6 +590,24 @@ def copy_to_target_and_divide(file: Path):
file_function=copy_to_target,
filter_fun=lambda f: (is_photo(f) or is_video(f))
)
if not args.keep_duplicates:
print('=====================')
print('Removing duplicates...')
print('=====================')
remove_duplicates(
dir=FIXED_DIR
)

print('=====================')
print('Populate albums...')
print('=====================')
for_all_files_recursive(
dir=PHOTOS_DIR,
folder_function=populate_album_map
)

with open(str(FIXED_DIR) + '/albums.json', 'w+') as outfile:
_json.dump(album_mmap, outfile)

print()
print('DONE! FREEDOM!')
Expand Down

0 comments on commit fbddfba

Please sign in to comment.