
# Introduction

This script is designed to efficiently organize and compress image files from a source directory. It provides an automated solution to sort images by year, optimize their size, and ensure a clean folder structure. Various functions and modules are utilized to accomplish a range of tasks.

## Main Features

- **Organize Images by Year**: Sorts images into year-based folders based on EXIF data or filenames.
- **Image Compression**: Compresses large images to save storage space while maintaining quality within defined limits.
- **Duplicate Detection**: Uses hashing to ensure no duplicate images are processed.
- **EXIF Data Utilization**: Extracts information such as the capture date and image orientation from EXIF data.
- **Error Handling**: Identifies and moves corrupted or faulty images to a specific folder.
- **Optimize Folder Structure**: Removes empty and unnecessary subfolders to create a flat and organized structure.

## Workflow

1. **Initialization**: Import necessary modules and define parameters like size limits and allowed file extensions.
2. **Directory Preparation**: Create required folders for corrupted images and images without date information.
3. **Image Processing**:
   - **Validation**: Check if the file is a valid image and exceeds the minimum size.
   - **Duplicate Checking**: Calculate an MD5 hash to identify already processed images.
   - **Extract Year**:
     - **From EXIF Data**: Attempt to extract the capture year from EXIF data.
     - **From Filename**: If EXIF data is unavailable, extract the year from the filename using regex patterns.
     - **Default Assignment**: Assign the image to the "NoDate" folder if no year can be determined.
   - **Image Compression**: Compress images larger than a set limit to reach the target size.
   - **Correct Orientation**: Adjust the image orientation based on EXIF data.
   - **Storage**: Copy the processed image into the appropriate year directory in the destination folder.
4. **Post-Processing**:
   - **Flatten Folder Structure**: Merge folders with only one subfolder to simplify the structure.
   - **Remove Empty Folders**: Delete empty folders that are no longer needed.
5. **Logging**: Record all errors and exceptions in a log file for easy tracking.

## Parameters and Settings

- **Size Limits**:
  - `small_size_limit`: 3 MB – Images above this size will be compressed.
  - `medium_size_limit`: 10 MB – Additional compression level for very large images.
  - `target_size_limit`: 6 MB – Target size for compressed images.
  - `min_image_size`: 0.05 MB – Minimum size for an image to be processed.
- **Directories**:
  - `corrupted_folder`: Folder for corrupted images.
  - `nodate_folder`: Folder for images without a determined date.
- **File Extensions**: Allowed extensions are `.jpg`, `.jpeg`, `.png`, `.tiff`, `.bmp`, `.gif`.

## Prerequisites

- **Python Modules**:
  - Standard libraries: `os`, `shutil`, `hashlib`, `re`, `time`, `logging`
  - External libraries: `PIL` (Pillow), `tqdm`
- **EXIF Support**: Images should contain EXIF metadata for certain features to work.

## Usage

1. **Set Directories**:
   - Define the `source_directory` as the path containing the images to be processed.
   - Define the `destination_directory` where the organized images will be stored.
2. **Run the Script**:
   - Ensure all prerequisites are installed.
   - Execute the script in a Python environment.
3. **Monitor Output**:
   - Processed images will be organized in the destination directory.
   - Logs will be recorded in `error_log.log`.
   - A summary of processing statistics will be displayed upon completion.

## Notes

- **Error Handling**: The script logs errors to `error_log.log`. Check this file if issues arise.
- **Performance**: The script uses `tqdm` for progress indication, which may slightly affect performance.
- **Backup**: Always backup your images before processing to prevent data loss.

In [1]:
import os
import shutil
import hashlib
import re
import time
from PIL import Image, ExifTags
from tqdm import tqdm
import logging

# Initialize logging
logging.basicConfig(filename='error_log.log', level=logging.ERROR)

# Define compression thresholds and size limits
small_size_limit = 3 * 1024 * 1024  # 3 MB
medium_size_limit = 10 * 1024 * 1024  # 10 MB
target_size_limit = 6 * 1024 * 1024  # 6 MB
min_image_size = 0.05 * 1024 * 1024  # 0.05 MB (50 KB) - Minimum valid image size
allowed_extensions = ['.jpg', '.jpeg', '.png', '.tiff', '.bmp', '.gif']

# Directory for corrupted files and "NoDate" folder
corrupted_folder = "CorruptedImages"
nodate_folder = "NoDate"

# Ensure directories for corrupted files and NoDate exist
os.makedirs(corrupted_folder, exist_ok=True)
os.makedirs(nodate_folder, exist_ok=True)

# Function to extract the year from EXIF data
def get_image_year(file_path):
    try:
        img = Image.open(file_path)
        exif_data = img._getexif()
        if exif_data:
            for tag, value in exif_data.items():
                decoded_tag = ExifTags.TAGS.get(tag, tag)
                if decoded_tag == 'DateTimeOriginal':
                    year = value.split(":")[0]
                    if validate_year(year):
                        return year
    except Exception as e:
        logging.error(f"EXIF extraction failed for {file_path}: {e}")
    return None

# Function to extract the year from a filename by matching common patterns
def extract_year_from_filename(filename):
    date_patterns = [
        r"IMG[_-](\d{4})(\d{2})(\d{2})",   # Matches IMG_YYYYMMDD or IMG-YYYYMMDD
        r"(\d{4})[-_](\d{2})[-_](\d{2})",  # Matches YYYY-MM-DD or YYYY_MM_DD
        r"(\d{8})",                        # Matches YYYYMMDD
        r"(\d{4})"                         # Matches any standalone year (fallback)
    ]
    
    for pattern in date_patterns:
        match = re.search(pattern, filename)
        if match:
            year = match.group(1)
            if validate_year(year):
                return year
    return None  # Return None if no valid date is found

# Function to validate the extracted year (only accept years between 2000 and 2030)
def validate_year(year):
    try:
        year = int(year)
        if 2000 <= year <= 2030:
            return True
    except ValueError:
        return False
    return False

# Check if the file is a valid image and is larger than 0.05MB using PIL
def is_image(file_path):
    try:
        if os.path.getsize(file_path) < min_image_size:
            return False  # Ignore images smaller than 0.05MB
        img = Image.open(file_path)
        img.verify()  # This will raise an exception if the file is not an image
        return True
    except (IOError, SyntaxError):
        return False

# Calculate the hash of the image to avoid duplicates
def calculate_hash(file_path):
    hash_md5 = hashlib.md5()
    with open(file_path, "rb") as f:
        for chunk in iter(lambda: f.read(4096), b""):
            hash_md5.update(chunk)
    return hash_md5.hexdigest()

# Fix image orientation based on EXIF data more reliably using Pillow's `ImageOps` module
def fix_orientation(img):
    try:
        exif = img._getexif()
        if exif is not None:
            orientation_key = 274  # Exif key for orientation
            if orientation_key in exif:
                orientation = exif[orientation_key]

                # Rotate according to EXIF orientation
                if orientation == 3:
                    img = img.rotate(180, expand=True)
                elif orientation == 6:
                    img = img.rotate(270, expand=True)
                elif orientation == 8:
                    img = img.rotate(90, expand=True)
    except (AttributeError, KeyError, IndexError):
        pass  # If the image has no EXIF orientation data, do nothing
    return img

# Compress the image based on its size, preserve EXIF, and store it in the destination folder
def compress_image(image_path, dest_dir, target_size_mb=6):
    img = Image.open(image_path)

    # Fix the image orientation based on EXIF data
    img = fix_orientation(img)

    # Extract EXIF data
    exif_data = img.info['exif'] if 'exif' in img.info else None

    quality = 95  # Start with high quality
    img_format = img.format if img.format != 'JPEG' else 'JPEG'

    # Create the compressed image path inside the destination directory
    output_path = os.path.join(dest_dir, os.path.basename(image_path))

    # Save the image with EXIF data preserved
    img.save(output_path, format=img_format, quality=quality, exif=exif_data)

    while os.path.getsize(output_path) > target_size_mb * 1024 * 1024:
        quality -= 10
        if quality < 10:
            break  # Stop if quality becomes too low
        img.save(output_path, format=img_format, quality=quality, exif=exif_data)

    return output_path  # Return the path of the compressed image

# Function to flatten folder structure if it contains only one subfolder
def flatten_single_subfolder_folders(folder_path):
    for root, dirs, files in os.walk(folder_path):
        # If the folder has only one directory and no files, flatten it
        if len(dirs) == 1 and not files:
            single_dir = dirs[0]
            src = os.path.join(root, single_dir)
            parent = os.path.dirname(root)
            new_path = os.path.join(parent, os.path.basename(single_dir))
            if not os.path.exists(new_path):
                shutil.move(src, parent)  # Move the subfolder up
                os.rmdir(root)  # Remove the empty intermediate folder

# Function to remove empty folders after processing
def remove_empty_folders(destination_dir):
    for root, dirs, files in os.walk(destination_dir, topdown=False):
        if not files and not dirs:  # If the folder is empty
            os.rmdir(root)

# Function to organize images into year-based folders (copy instead of move)
def organize_images(source_dir, destination_dir):
    os.makedirs(destination_dir, exist_ok=True)

    processed_images = set()
    if os.path.exists('processed_images.txt'):
        with open('processed_images.txt', 'r') as f:
            processed_images.update(f.read().splitlines())

    # Count all images in all folders for a single progress bar
    total_images = sum([len(files) for r, d, files in os.walk(source_dir) if files])

    image_count = 0
    total_size_before = 0
    total_size_after = 0

    with open('processed_images.txt', 'a') as processed_file, tqdm(total=total_images, desc="Processing images") as pbar:
        for root, dirs, files in os.walk(source_dir):
            for file in files:
                file_path = os.path.join(root, file)
                if not is_image(file_path) or file_path in processed_images:
                    pbar.update(1)
                    continue

                image_count += 1
                total_size_before += os.path.getsize(file_path)

                # Calculate the hash to avoid duplicate processing
                file_hash = calculate_hash(file_path)
                if file_hash in processed_images:
                    pbar.update(1)
                    continue  # Skip duplicate images

                # Extract the year from EXIF data or filename, validate the year
                year = get_image_year(file_path) or extract_year_from_filename(file) or "NoDate"

                # Define destination directory with year and folder structure
                relative_path = os.path.relpath(root, source_dir)
                dest_dir = os.path.join(destination_dir, year, relative_path)
                os.makedirs(dest_dir, exist_ok=True)

                # Determine compression strategy based on size
                file_size = os.path.getsize(file_path)
                if file_size > small_size_limit:
                    try:
                        # Compress and save directly in the destination folder (no extra copy step)
                        compress_image(file_path, dest_dir)
                        total_size_after += os.path.getsize(os.path.join(dest_dir, file))
                    except PermissionError as e:
                        logging.error(f"Permission denied for {file_path}: {e}")
                        pbar.update(1)
                        continue
                    except Exception as e:
                        logging.error(f"Compression failed for {file_path}: {e}")
                        # Retry copying the file to the corrupted folder
                        retries = 3
                        for attempt in range(retries):
                            try:
                                shutil.copy2(file_path, os.path.join(corrupted_folder, file))
                                break
                            except PermissionError as e:
                                logging.error(f"Retry {attempt+1}/{retries} failed for {file_path}: {e}")
                                time.sleep(1)  # Sleep for a second and retry
                                if attempt == retries - 1:
                                    logging.error(f"Copying to CorruptedImages failed for {file_path}: {e}")
                                    continue
                else:
                    try:
                        shutil.copy2(file_path, os.path.join(dest_dir, file))  # Copy without compression
                    except PermissionError as e:
                        logging.error(f"Permission denied for {file_path}: {e}")
                        pbar.update(1)
                        continue

                processed_file.write(f"{file_hash}\n")
                processed_images.add(file_hash)
                pbar.update(1)

        # Flatten folder structure after all files are processed
        flatten_single_subfolder_folders(destination_dir)

    # Remove empty folders
    remove_empty_folders(destination_dir)

    # Final summary
    print(f"Processed {image_count} images.")
    print(f"Total size before: {total_size_before / (1024 * 1024)} MB")
    print(f"Total size after: {total_size_after / (1024 * 1024)} MB")
    
# Example usage
source_directory = r"E:\Pictures\"
destination_directory = r"C:\Users\XYZ\OneDrive\Media"

organize_images(source_directory, destination_directory)



Processing images: 100%|██████████| 202318/202318 [13:44:58<00:00,  4.09it/s]   


Processed 172982 images.
Total size before: 444309.82852745056 MB
Total size after: 171154.7778558731 MB
