PhotoDupeCleaner

A local-only tool for finding and removing duplicate (or near-duplicate) photos from a directory tree, with both a command-line interface and a browser UI for visual review. This package is designed around deduplicating 'photos' (not videos) from an iPhone.

Built for cleaning up exported Apple Photos libraries (handles .heic/ .heif natively), but works on any folder of .jpg/.jpeg/.png/.heic/ .heif files.

Features

Three detection strategies, from strictest to most forgiving:
- SHA-256 — byte-identical duplicates only (false-positive rate: zero)
- MSE — mean-squared-error on downsampled tensors (catches re-saves and minor edits)
- pHash — perceptual hashing + Hamming distance (catches crops, rotations, re-encodes, fastest of the three)
HEIC/HEIF support via pillow-heif.
Multiprocess hashing so large libraries finish in minutes, not hours.
Web UI with an interactive threshold slider — re-cluster instantly without re-hashing.
Safe by construction: never deletes; only ever copies survivors to a new directory.

Install

Requires Python 3.13+. Uses uv for environment management.

git clone git@github.com:Aspho1/PhotoDupeCleaner.git PhotoDupeCleaner
cd PhotoDupeCleaner
uv sync

That's it — uv sync reads pyproject.toml and creates the venv with all dependencies.

CLI usage

uv run ./main.py -d <input-dir> [-o <output-dir>] [options]

Quick examples

Find exact duplicates and print the groups (no files written):

uv run ./main.py -d ~/Pictures/phone -m 0

Find near-duplicates by perceptual hash and write a deduplicated copy:

uv run ./main.py -d ~/Pictures/phone -o ~/Pictures/phone-clean -m 2

Use 8 processes, drop the directory tree on output, exclude Thumbnails:

uv run ./main.py -d ~/Pictures/phone -o ~/Pictures/phone-clean \
    -m 2 -p 8 -s false -e Thumbnails

All flags

Flag	Default	Purpose
`-d`, `--directory`	required	Top directory to walk.
`-o`, `--output-directory`	none	If given, mirrors input → output, skipping every non-first member of each duplicate group. Created if missing.
`-e`, `--exclude`	`[]`	One or more substrings; any path containing one is skipped.
`-s`, `--preserve-structure`	`true`	When false, the output is flattened into a single directory.
`-p`, `--processes`	`1`	Worker process count. `-1` uses all cores.
`-m`, `--method`	`0`	`0`=SHA-256, `1`=MSE, `2`=pHash. See Methods.
`-x`, `--px-size`	`64`	Edge length for the downscaled tensor used by `method=1`.
`-t`, `--threshold`	method-specific	Distance cutoff. `method=1`: MSE float (default `15.0`). `method=2`: Hamming-distance bits (default `10`).

Selection policy

When --output-directory is set, every detected duplicate group keeps exactly one member — the first one in the group's internal ordering — and discards the rest. If you want manual control over which member is kept, use the web UI.

Web UI

uv run python app.py

Then open http://127.0.0.1:5000.

The web UI is the right choice when you want to see the duplicates before committing. The pipeline:

Browse / scan a directory → see per-extension counts.
Hash all images (slow step, parallelized — uses pHash with default hash_size=8, so 64 bits per image).
Cluster with a threshold slider. The slider is instant because all pair distances are pre-bucketed; re-clustering is just a union-find sweep over a prefix of those buckets.
Review clusters, click thumbnails to deselect anything you want to keep.
Copy the survivors into an output directory.

The server is local-only and single-session by design — it uses module globals for state rather than a session store, because there's nothing to multiplex.

Methods

`-m 0` — SHA-256 (exact duplicates)

Two-pass: bucket by file size first (one stat per file), then SHA-256 only the buckets with collisions. Skips hashing for the vast majority of unique files. Finds byte-identical copies but nothing else — a re-saved JPEG is invisible to this method.

`-m 1` — MSE on downsampled tensors

Every image is resized to px_size × px_size RGB and compared pairwise via mean-squared-error. Catches small edits and minor re-encodings. Cost is O(n²) in image count, but each pair is just an elementwise diff on a tiny tensor so the constant is small. Tune --px-size and --threshold together — smaller px_size is faster but loses detail; higher threshold admits more pairs.

`-m 2` — Perceptual hash + Hamming distance

Each image is reduced to a 64-bit perceptual hash (imagehash.phash); two images are duplicates when their hashes differ in <= threshold bits. pHash is robust to crops, rotations, re-encodes — and comparing two hashes is one XOR plus a bit_count, so the all-pairs step is orders of magnitude faster than MSE. The fastest and most forgiving method.

The web UI uses this method exclusively.

How it works (architecture)

Three files, one job each:

core.py     — all the actual work (walking, hashing, clustering, copying)
main.py     — argparse CLI on top of core
app.py      — Flask routes on top of core

core.py is the contract. Both entry points reduce to "parse input → call into core → format output". Adding a new comparison strategy means adding one function to core.py and one branch each to main._find_duplicate_groups and (optionally) an /api/* route.

Pipeline at a glance

              ┌──────────────┐
              │ walk_images  │  recursive walk, excludes, allowed exts
              └──────┬───────┘
                     │ list[rel paths]
       ┌─────────────┼──────────────┐
       │             │              │
       ▼             ▼              ▼
 sha256_duplicates  compute_phashes  compute_tensors → mse_pairs
       │             │              │
       │             ▼              │
       │     build_distance_buckets │
       │             │              │
       │             ▼              │
       │      cluster_at_threshold  │
       │             │              │
       └─────────────┴───────┬──────┘
                             │ list[list[rel path]]   (duplicate groups)
                             ▼
                     copy_with_discards

The pHash path is split deliberately: bucketing every pair by exact Hamming distance once (build_distance_buckets, O(n²)) lets the clustering step re-run at any threshold in roughly O(merges) — which is what makes the web UI's threshold slider feel instant.

The mst_max_edge helper returns the smallest threshold at which all images merge into one component — used by the web UI to set the slider's upper bound, so the user can't drag past the point where the output stops changing.

Project layout

PhotoDupeCleaner/
├── core.py            shared library — all heavy lifting
├── main.py            CLI entry point
├── app.py             Flask entry point
├── static/            web UI assets (CSS/JS)
├── templates/         web UI Jinja templates
├── pyproject.toml     deps, metadata, tool config
└── uv.lock            resolved dependency graph

Development

uv sync --extra dev      # installs black, pytest, etc.
uv run black .           # format everything
uv run black --check .   # what CI runs (formatting gate)
uv run pytest            # run the test suite

Tests live under tests/ and cover the pure-logic parts of core.py plus end-to-end pipelines through the Flask test client and the CLI subprocess. The CI workflow at .github/workflows/ci.yml runs both black --check and pytest on every push to main and on every PR. If either fails locally, fix it before pushing — there are no auto-format / auto-commit shortcuts wired in.

License

MIT.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PhotoDupeCleaner

Features

Install

CLI usage

Quick examples

All flags

Selection policy

Web UI

Methods

`-m 0` — SHA-256 (exact duplicates)

`-m 1` — MSE on downsampled tensors

`-m 2` — Perceptual hash + Hamming distance

How it works (architecture)

Pipeline at a glance

Project layout

Development

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github/workflows		.github/workflows
static		static
templates		templates
tests		tests
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
app.py		app.py
core.py		core.py
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

PhotoDupeCleaner

Features

Install

CLI usage

Quick examples

All flags

Selection policy

Web UI

Methods

-m 0 — SHA-256 (exact duplicates)

-m 1 — MSE on downsampled tensors

-m 2 — Perceptual hash + Hamming distance

How it works (architecture)

Pipeline at a glance

Project layout

Development

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`-m 0` — SHA-256 (exact duplicates)

`-m 1` — MSE on downsampled tensors

`-m 2` — Perceptual hash + Hamming distance

Packages