Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
18 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 28 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
name: CI

on:
push:
pull_request:

jobs:
check:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v6

- uses: actions/setup-python@v6
with:
python-version: "3.11"

- uses: astral-sh/setup-uv@v8.0.0

- run: uv sync --all-extras --dev

- name: Lint
run: uv run ruff check .

- name: Type check
run: uv run mypy multiclean/

- name: Test
run: uv run pytest tests/ -x -q
43 changes: 43 additions & 0 deletions .github/workflows/publish.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
name: Publish to PyPI

on:
push:
tags:
- "v*"

jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v6
with:
fetch-depth: 0
fetch-tags: true

- uses: actions/setup-python@v6
with:
python-version: "3.11"

- uses: astral-sh/setup-uv@v8.0.0

- name: Build sdist and wheel
run: uv build

- uses: actions/upload-artifact@v6
with:
name: dist
path: dist/

publish:
needs: build
runs-on: ubuntu-latest
environment: pypi
permissions:
id-token: write
steps:
- uses: actions/download-artifact@v8
with:
name: dist
path: dist/

- uses: pypa/gh-action-pypi-publish@release/v1
15 changes: 15 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
repos:
- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.15.8
hooks:
- id: ruff-check
- id: ruff-format
- repo: local
hooks:
- id: pytest
name: pytest
entry: uv run pytest tests/ -x -q
language: system
pass_filenames: false
always_run: true
stages: [pre-push]
63 changes: 63 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
# Changelog

All notable changes to MultiClean are documented here.

## [Unreleased]

## [0.3.0] - 2026-05-02

### Changed
- **Performance.** `clean_array` is substantially faster on multi-class inputs.
On a 15669×18633 / 147-class land-use raster, end-to-end runtime dropped
from ~85 s to ~40 s. On the 8011×7901 / 4-class Landsat cloud-and-shadow
example, runtime dropped from ~2.5 s to ~1.1 s. Wins came from:
- Replacing the float32 smoothed-labels buffer with a `uint8`/`uint16`
class-code array (selected automatically based on class count). The
per-class equality scan is 2-4× cheaper in memory bandwidth.
- Combining per-class small-island masks in flight instead of
accumulating all K of them first.
- Filling invalid pixels in place rather than allocating a copy.
- Replacing `scipy.ndimage.distance_transform_edt` with
`cv2.distanceTransformWithLabels` for the nearest-valid fill (~3.4×
faster on the fill stage). Both algorithms produce mathematically
equivalent output (the same minimum L2 distance); they differ only in
which equidistant source pixel wins a tie.
- **dtype preservation.** The output now strictly matches the input dtype.
Previously the pipeline routed everything through float32 internally,
which silently downcast `float64` inputs and rounded `int32` values
larger than 2²⁴ (and `int64` values larger than 2⁵³).

### Fixed
- All-NaN float input with `fill_nan=True` now deterministically returns
an all-NaN array. The previous code relied on whatever value
`np.empty` happened to leave in the sentinel slot.
- Large integer class values (`int32` > 2²⁴, `int64` > 2⁵³) are now
preserved bit-exactly, instead of being silently rounded by the
internal float32 round-trip.

### Removed
- Dropped the `scipy` runtime dependency. `cv2` (already a runtime
dependency) now handles the distance-transform fill.

## [0.2.0] - 2025-09-03

### Added
- `fill_nan` option on `clean_array`: when `True`, NaN values in float
input arrays are filled from the nearest valid pixel rather than
preserved as nodata.

## [0.1.0] - 2025-09-02

### Added
- Initial public release.
- `clean_array` API for morphological cleaning of multi-class 2D arrays:
per-class edge smoothing (morphological opening), per-class small-island
removal (connected components), and gap filling using nearest-valid via
Euclidean distance transform.
- Documentation: README, two example notebooks (land use, cloud
shadow), and a Google Colab tutorial notebook.

[Unreleased]: https://github.com/DPIRD-DMA/MultiClean/compare/v0.3.0...HEAD
[0.3.0]: https://github.com/DPIRD-DMA/MultiClean/compare/v0.2.0...v0.3.0
[0.2.0]: https://github.com/DPIRD-DMA/MultiClean/compare/v0.1.0...v0.2.0
[0.1.0]: https://github.com/DPIRD-DMA/MultiClean/releases/tag/v0.1.0
12 changes: 9 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,14 +61,14 @@ MultiClean is designed for cleaning segmentation outputs from:
- **Edge smoothing**: Morphological opening to reduce jagged boundaries
- **Island removal**: Remove small connected components per class
- **Gap filling**: Fill invalids via nearest valid class (distance transform)
- **Fast**: NumPy + OpenCV + SciPy with parallelism
- **Fast**: NumPy + OpenCV with parallelism


## How It Works

MultiClean uses morphological operations to clean classification arrays:

1. **Edge smoothing (per class)**: Morphological opening with an elliptical kernel.
1. **Edge smoothing (per class)**: Morphological opening with a circular kernel.
2. **Island removal (per class)**: Find connected components (OpenCV) and mark components with area `< min_island_size` as invalid.
3. **Gap filling**: Compute a distance transform to copy the nearest valid class into invalid pixels.

Expand Down Expand Up @@ -100,7 +100,7 @@ out = clean_array(
- `max_workers`: Parallelism for per-class operations (None lets the executor choose).
- `fill_nan`: If True will fill NAN values from input array with nearest valid value.

Returns a numpy array matching the input shape. Integer inputs return integer outputs. Float arrays with `NaN` are supported and can be filled or remain as NAN.
Returns a numpy array matching the input shape and dtype. Float arrays with `NaN` are supported and can be filled or remain as `NaN`.

## Examples

Expand Down Expand Up @@ -156,10 +156,16 @@ See the notebooks folder for end-to-end examples:
[Colab_Button]: https://img.shields.io/badge/Try%20in%20Colab-grey?style=for-the-badge&logo=google-colab


## Changelog

Release notes and the full version history are kept in [CHANGELOG.md](https://github.com/DPIRD-DMA/MultiClean/blob/main/CHANGELOG.md).

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Maintainers: see [RELEASING.md](https://github.com/DPIRD-DMA/MultiClean/blob/main/RELEASING.md) for how to cut a release.

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
49 changes: 49 additions & 0 deletions RELEASING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
# Releasing MultiClean

How to cut a new release to PyPI. The whole flow is driven from a single
`v*` git tag — `setuptools-scm` reads the version from the tag and
`pypa/gh-action-pypi-publish` ships the wheel.

## Cutting a release

1. **Update [`CHANGELOG.md`](CHANGELOG.md).**
Promote the `[Unreleased]` section to `[X.Y.Z] - YYYY-MM-DD` and add
a fresh empty `[Unreleased]` heading on top. Update the comparison
links at the bottom of the file to add the new version.

2. **Merge to `main`.**
Make sure the merge commit is the one you intend to release —
the tag will be cut from it.

3. **Tag and push.**
```bash
git checkout main
git pull
git tag vX.Y.Z # e.g. v0.3.0
git push origin vX.Y.Z
```

4. **Approve the deployment.**
The push triggers the [`Publish to PyPI`](.github/workflows/publish.yml)
workflow. Open *Actions* on GitHub and click *Review pending
deployments → Approve and deploy* on the `pypi` environment when
prompted.

5. **Verify on PyPI.**
Within a minute or two `pip install multiclean==X.Y.Z` should
work and the project page on
<https://pypi.org/project/multiclean/> should show the new version.

## Notes

- **Versions come from tags.** `multiclean.__version__` and the wheel
filename both come from `setuptools-scm` reading the latest `v*` tag.
Don't hand-edit a version anywhere — bumping a tag is the entire
bump.
- **Pre-releases** (e.g. `v0.4.0rc1`) work the same way; PEP 440 markers
in the tag carry through.
- **Yanking a bad release** is done from the PyPI web UI, not from this
repo. The tag and CHANGELOG entry stay.
- **Hotfix on an older line** (e.g. `v0.3.1` while `main` is on `0.4.x`):
branch from the older tag, fix, tag `v0.3.1` on that branch, push the
tag. The same workflow handles it.
4 changes: 3 additions & 1 deletion assets/Make README graphic.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
"cells": [
{
"cell_type": "markdown",
"id": "7fb27b941602401d91542211134fc71a",
"metadata": {},
"source": [
"# Land Use README Graphic\n",
Expand All @@ -25,6 +26,7 @@
{
"cell_type": "code",
"execution_count": 2,
"id": "acae54e37e7d407bbb7b55eff062a284",
"metadata": {},
"outputs": [
{
Expand Down Expand Up @@ -82,7 +84,7 @@
"out_path = assets_dir / \"land_use_before_after.png\"\n",
"fig.savefig(out_path, dpi=220, bbox_inches=\"tight\")\n",
"print(\"Saved figure to\", out_path)\n",
"plt.show()\n"
"plt.show()"
]
}
],
Expand Down
5 changes: 4 additions & 1 deletion multiclean/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,7 @@
from .__version__ import __version__
from importlib.metadata import version

from .multiclean import clean_array

__version__ = version("multiclean")

__all__ = ["clean_array", "__version__"]
1 change: 0 additions & 1 deletion multiclean/__version__.py

This file was deleted.

77 changes: 35 additions & 42 deletions multiclean/multiclean.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,7 @@

import numpy as np

from .utils import (
build_invalid_mask,
fill_invalids,
find_small_islands,
smooth_edges,
)
from .utils import build_invalid_mask, fill_invalids, smooth_edges_to_codes


def clean_array(
Expand Down Expand Up @@ -58,68 +53,66 @@ def clean_array(
if array.ndim != 2:
raise ValueError("Input array must be 2D")

is_float = np.issubdtype(array.dtype, np.floating)

all_class_values = np.unique(array).tolist()
# Remove NaN from class values if present
if np.issubdtype(array.dtype, np.floating):
if is_float:
all_class_values = [v for v in all_class_values if not np.isnan(v)]

if class_values is None:
target_class_values = all_class_values
target_class_values = list(all_class_values)
elif isinstance(class_values, int):
target_class_values = [class_values]
else:
if isinstance(class_values, int):
target_class_values = [class_values]
else:
target_class_values = list(class_values)
target_class_values = list(class_values)

background_class_values = list(set(all_class_values) - set(target_class_values))

if np.issubdtype(array.dtype, np.floating) and not fill_nan:
if is_float and not fill_nan:
nan_mask = np.isnan(array)
if nan_mask.any():
background_class_values.append(np.nan)
if not nan_mask.any():
nan_mask = None
else:
nan_mask = None

smoothed_labels = smooth_edges(
codes, code_to_value = smooth_edges_to_codes(
array=array,
smooth_edge_size=smooth_edge_size,
target_class_values=target_class_values,
background_class_values=background_class_values,
all_class_values=all_class_values,
max_workers=max_workers,
)

small_islands_by_class = find_small_islands(
smoothed_labels=smoothed_labels,
target_class_values=target_class_values,
# Find target codes (1..K) for the requested target classes.
classes_sorted = sorted(all_class_values)
value_to_code = {v: i + 1 for i, v in enumerate(classes_sorted)}
target_codes = [value_to_code[v] for v in target_class_values if v in value_to_code]

invalid_mask = build_invalid_mask(
codes=codes,
target_codes=target_codes,
min_island_size=min_island_size,
connectivity=connectivity,
max_workers=max_workers,
)

invalid_mask = build_invalid_mask(
smoothed_labels=smoothed_labels,
small_islands_by_class=small_islands_by_class,
)
codes = fill_invalids(codes, invalid_mask)

if not invalid_mask.any():
# Apply original NaN mask if present
if nan_mask is not None and nan_mask.any():
smoothed_labels[nan_mask] = np.nan
if np.issubdtype(array.dtype, np.integer):
return smoothed_labels.astype(array.dtype, copy=False)
return smoothed_labels

output = fill_invalids(
smoothed_labels=smoothed_labels,
invalid_mask=invalid_mask,
all_class_values=all_class_values,
)
# Decode codes back to class values via vectorised lookup. ``np.take``
# uses ``out`` so we can write directly into a typed buffer.
output = code_to_value[codes]

if is_float and nan_mask is not None:
# Restore original NaN positions when fill_nan=False (they were
# included in invalid_mask only to keep them off the fill-source set).
if not np.issubdtype(output.dtype, np.floating):
output = output.astype(np.float64)
output[nan_mask] = np.nan

# Convert back to original dtype if integer
if np.issubdtype(array.dtype, np.integer):
return output.astype(array.dtype, copy=False)
output = output.astype(array.dtype, copy=False)
elif is_float and output.dtype != array.dtype:
output = output.astype(array.dtype, copy=False)

# Apply original NaN mask if present
if nan_mask is not None and nan_mask.any():
output[nan_mask] = np.nan
return output
Empty file added multiclean/py.typed
Empty file.
Loading