Skip to content

Aspho1/PhotoDupeCleaner

Repository files navigation

PhotoDupeCleaner

A local-only tool for finding and removing duplicate (or near-duplicate) photos from a directory tree, with both a command-line interface and a browser UI for visual review. This package is designed around deduplicating 'photos' (not videos) from an iPhone.

Built for cleaning up exported Apple Photos libraries (handles .heic/ .heif natively), but works on any folder of .jpg/.jpeg/.png/.heic/ .heif files.


Features

  • Three detection strategies, from strictest to most forgiving:
    • SHA-256 — byte-identical duplicates only (false-positive rate: zero)
    • MSE — mean-squared-error on downsampled tensors (catches re-saves and minor edits)
    • pHash — perceptual hashing + Hamming distance (catches crops, rotations, re-encodes, fastest of the three)
  • HEIC/HEIF support via pillow-heif.
  • Multiprocess hashing so large libraries finish in minutes, not hours.
  • Web UI with an interactive threshold slider — re-cluster instantly without re-hashing.
  • Safe by construction: never deletes; only ever copies survivors to a new directory.

Install

Requires Python 3.13+. Uses uv for environment management.

git clone git@github.com:Aspho1/PhotoDupeCleaner.git PhotoDupeCleaner
cd PhotoDupeCleaner
uv sync

That's it — uv sync reads pyproject.toml and creates the venv with all dependencies.


CLI usage

uv run ./main.py -d <input-dir> [-o <output-dir>] [options]

Quick examples

Find exact duplicates and print the groups (no files written):

uv run ./main.py -d ~/Pictures/phone -m 0

Find near-duplicates by perceptual hash and write a deduplicated copy:

uv run ./main.py -d ~/Pictures/phone -o ~/Pictures/phone-clean -m 2

Use 8 processes, drop the directory tree on output, exclude Thumbnails:

uv run ./main.py -d ~/Pictures/phone -o ~/Pictures/phone-clean \
    -m 2 -p 8 -s false -e Thumbnails

All flags

Flag Default Purpose
-d, --directory required Top directory to walk.
-o, --output-directory none If given, mirrors input → output, skipping every non-first member of each duplicate group. Created if missing.
-e, --exclude [] One or more substrings; any path containing one is skipped.
-s, --preserve-structure true When false, the output is flattened into a single directory.
-p, --processes 1 Worker process count. -1 uses all cores.
-m, --method 0 0=SHA-256, 1=MSE, 2=pHash. See Methods.
-x, --px-size 64 Edge length for the downscaled tensor used by method=1.
-t, --threshold method-specific Distance cutoff. method=1: MSE float (default 15.0). method=2: Hamming-distance bits (default 10).

Selection policy

When --output-directory is set, every detected duplicate group keeps exactly one member — the first one in the group's internal ordering — and discards the rest. If you want manual control over which member is kept, use the web UI.


Web UI

uv run python app.py

Then open http://127.0.0.1:5000.

The web UI is the right choice when you want to see the duplicates before committing. The pipeline:

  1. Browse / scan a directory → see per-extension counts.
  2. Hash all images (slow step, parallelized — uses pHash with default hash_size=8, so 64 bits per image).
  3. Cluster with a threshold slider. The slider is instant because all pair distances are pre-bucketed; re-clustering is just a union-find sweep over a prefix of those buckets.
  4. Review clusters, click thumbnails to deselect anything you want to keep.
  5. Copy the survivors into an output directory.

The server is local-only and single-session by design — it uses module globals for state rather than a session store, because there's nothing to multiplex.


Methods

-m 0 — SHA-256 (exact duplicates)

Two-pass: bucket by file size first (one stat per file), then SHA-256 only the buckets with collisions. Skips hashing for the vast majority of unique files. Finds byte-identical copies but nothing else — a re-saved JPEG is invisible to this method.

-m 1 — MSE on downsampled tensors

Every image is resized to px_size × px_size RGB and compared pairwise via mean-squared-error. Catches small edits and minor re-encodings. Cost is O(n²) in image count, but each pair is just an elementwise diff on a tiny tensor so the constant is small. Tune --px-size and --threshold together — smaller px_size is faster but loses detail; higher threshold admits more pairs.

-m 2 — Perceptual hash + Hamming distance

Each image is reduced to a 64-bit perceptual hash (imagehash.phash); two images are duplicates when their hashes differ in <= threshold bits. pHash is robust to crops, rotations, re-encodes — and comparing two hashes is one XOR plus a bit_count, so the all-pairs step is orders of magnitude faster than MSE. The fastest and most forgiving method.

The web UI uses this method exclusively.


How it works (architecture)

Three files, one job each:

core.py     — all the actual work (walking, hashing, clustering, copying)
main.py     — argparse CLI on top of core
app.py      — Flask routes on top of core

core.py is the contract. Both entry points reduce to "parse input → call into core → format output". Adding a new comparison strategy means adding one function to core.py and one branch each to main._find_duplicate_groups and (optionally) an /api/* route.

Pipeline at a glance

              ┌──────────────┐
              │ walk_images  │  recursive walk, excludes, allowed exts
              └──────┬───────┘
                     │ list[rel paths]
       ┌─────────────┼──────────────┐
       │             │              │
       ▼             ▼              ▼
 sha256_duplicates  compute_phashes  compute_tensors → mse_pairs
       │             │              │
       │             ▼              │
       │     build_distance_buckets │
       │             │              │
       │             ▼              │
       │      cluster_at_threshold  │
       │             │              │
       └─────────────┴───────┬──────┘
                             │ list[list[rel path]]   (duplicate groups)
                             ▼
                     copy_with_discards

The pHash path is split deliberately: bucketing every pair by exact Hamming distance once (build_distance_buckets, O(n²)) lets the clustering step re-run at any threshold in roughly O(merges) — which is what makes the web UI's threshold slider feel instant.

The mst_max_edge helper returns the smallest threshold at which all images merge into one component — used by the web UI to set the slider's upper bound, so the user can't drag past the point where the output stops changing.


Project layout

PhotoDupeCleaner/
├── core.py            shared library — all heavy lifting
├── main.py            CLI entry point
├── app.py             Flask entry point
├── static/            web UI assets (CSS/JS)
├── templates/         web UI Jinja templates
├── pyproject.toml     deps, metadata, tool config
└── uv.lock            resolved dependency graph

Development

uv sync --extra dev      # installs black, pytest, etc.
uv run black .           # format everything
uv run black --check .   # what CI runs (formatting gate)
uv run pytest            # run the test suite

Tests live under tests/ and cover the pure-logic parts of core.py plus end-to-end pipelines through the Flask test client and the CLI subprocess. The CI workflow at .github/workflows/ci.yml runs both black --check and pytest on every push to main and on every PR. If either fails locally, fix it before pushing — there are no auto-format / auto-commit shortcuts wired in.


License

MIT.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors