A local-only tool for finding and removing duplicate (or near-duplicate) photos from a directory tree, with both a command-line interface and a browser UI for visual review. This package is designed around deduplicating 'photos' (not videos) from an iPhone.
Built for cleaning up exported Apple Photos libraries (handles .heic/
.heif natively), but works on any folder of .jpg/.jpeg/.png/.heic/
.heif files.
- Three detection strategies, from strictest to most forgiving:
- SHA-256 — byte-identical duplicates only (false-positive rate: zero)
- MSE — mean-squared-error on downsampled tensors (catches re-saves and minor edits)
- pHash — perceptual hashing + Hamming distance (catches crops, rotations, re-encodes, fastest of the three)
- HEIC/HEIF support via
pillow-heif. - Multiprocess hashing so large libraries finish in minutes, not hours.
- Web UI with an interactive threshold slider — re-cluster instantly without re-hashing.
- Safe by construction: never deletes; only ever copies survivors to a new directory.
Requires Python 3.13+. Uses uv for environment management.
git clone git@github.com:Aspho1/PhotoDupeCleaner.git PhotoDupeCleaner
cd PhotoDupeCleaner
uv syncThat's it — uv sync reads pyproject.toml and creates the venv with all dependencies.
uv run ./main.py -d <input-dir> [-o <output-dir>] [options]Find exact duplicates and print the groups (no files written):
uv run ./main.py -d ~/Pictures/phone -m 0Find near-duplicates by perceptual hash and write a deduplicated copy:
uv run ./main.py -d ~/Pictures/phone -o ~/Pictures/phone-clean -m 2Use 8 processes, drop the directory tree on output, exclude Thumbnails:
uv run ./main.py -d ~/Pictures/phone -o ~/Pictures/phone-clean \
-m 2 -p 8 -s false -e Thumbnails| Flag | Default | Purpose |
|---|---|---|
-d, --directory |
required | Top directory to walk. |
-o, --output-directory |
none | If given, mirrors input → output, skipping every non-first member of each duplicate group. Created if missing. |
-e, --exclude |
[] |
One or more substrings; any path containing one is skipped. |
-s, --preserve-structure |
true |
When false, the output is flattened into a single directory. |
-p, --processes |
1 |
Worker process count. -1 uses all cores. |
-m, --method |
0 |
0=SHA-256, 1=MSE, 2=pHash. See Methods. |
-x, --px-size |
64 |
Edge length for the downscaled tensor used by method=1. |
-t, --threshold |
method-specific | Distance cutoff. method=1: MSE float (default 15.0). method=2: Hamming-distance bits (default 10). |
When --output-directory is set, every detected duplicate group keeps
exactly one member — the first one in the group's internal ordering —
and discards the rest. If you want manual control over which member is
kept, use the web UI.
uv run python app.pyThen open http://127.0.0.1:5000.
The web UI is the right choice when you want to see the duplicates before committing. The pipeline:
- Browse / scan a directory → see per-extension counts.
- Hash all images (slow step, parallelized — uses pHash with default
hash_size=8, so 64 bits per image). - Cluster with a threshold slider. The slider is instant because all pair distances are pre-bucketed; re-clustering is just a union-find sweep over a prefix of those buckets.
- Review clusters, click thumbnails to deselect anything you want to keep.
- Copy the survivors into an output directory.
The server is local-only and single-session by design — it uses module globals for state rather than a session store, because there's nothing to multiplex.
Two-pass: bucket by file size first (one stat per file), then SHA-256
only the buckets with collisions. Skips hashing for the vast majority of
unique files. Finds byte-identical copies but nothing else — a re-saved
JPEG is invisible to this method.
Every image is resized to px_size × px_size RGB and compared pairwise
via mean-squared-error. Catches small edits and minor re-encodings.
Cost is O(n²) in image count, but each pair is just an elementwise diff
on a tiny tensor so the constant is small. Tune --px-size and
--threshold together — smaller px_size is faster but loses detail;
higher threshold admits more pairs.
Each image is reduced to a 64-bit perceptual hash (imagehash.phash);
two images are duplicates when their hashes differ in <= threshold
bits. pHash is robust to crops, rotations, re-encodes — and comparing
two hashes is one XOR plus a bit_count, so the all-pairs step is
orders of magnitude faster than MSE. The fastest and most forgiving
method.
The web UI uses this method exclusively.
Three files, one job each:
core.py — all the actual work (walking, hashing, clustering, copying)
main.py — argparse CLI on top of core
app.py — Flask routes on top of core
core.py is the contract. Both entry points reduce to "parse
input → call into core → format output". Adding a new comparison
strategy means adding one function to core.py and one branch each to
main._find_duplicate_groups and (optionally) an /api/* route.
┌──────────────┐
│ walk_images │ recursive walk, excludes, allowed exts
└──────┬───────┘
│ list[rel paths]
┌─────────────┼──────────────┐
│ │ │
▼ ▼ ▼
sha256_duplicates compute_phashes compute_tensors → mse_pairs
│ │ │
│ ▼ │
│ build_distance_buckets │
│ │ │
│ ▼ │
│ cluster_at_threshold │
│ │ │
└─────────────┴───────┬──────┘
│ list[list[rel path]] (duplicate groups)
▼
copy_with_discards
The pHash path is split deliberately: bucketing every pair by exact
Hamming distance once (build_distance_buckets, O(n²)) lets the
clustering step re-run at any threshold in roughly O(merges) — which is
what makes the web UI's threshold slider feel instant.
The mst_max_edge helper returns the smallest threshold at which all
images merge into one component — used by the web UI to set the
slider's upper bound, so the user can't drag past the point where the
output stops changing.
PhotoDupeCleaner/
├── core.py shared library — all heavy lifting
├── main.py CLI entry point
├── app.py Flask entry point
├── static/ web UI assets (CSS/JS)
├── templates/ web UI Jinja templates
├── pyproject.toml deps, metadata, tool config
└── uv.lock resolved dependency graph
uv sync --extra dev # installs black, pytest, etc.
uv run black . # format everything
uv run black --check . # what CI runs (formatting gate)
uv run pytest # run the test suiteTests live under tests/ and cover the pure-logic parts of
core.py plus end-to-end pipelines through the Flask test client and
the CLI subprocess. The CI workflow at
.github/workflows/ci.yml runs both
black --check and pytest on every push to main and on every PR.
If either fails locally, fix it before pushing — there are no
auto-format / auto-commit shortcuts wired in.
MIT.