roughclust

Rapidly sort a Pandas DataFrame of numeric values to minimize the spatial distance between adjacent elements.

When working with large dense datasets, with large numbers of both observations and features, it can be useful to perform linkage clustering to visualy inspect what groupings of correlated observations / features exist in the data.

The computational method most commonly used for this purpose, linkage clustering, scales exponentially with dataset size and quickly becomes impractical with 10,000's of items. While there is an accelerated version of linkage clustering provided by the fastcluster package, that method still performs the full linkage clustering process.

Another useful approach for sorting large dense tables is to use t-SNE or UMAP to project each item onto a single dimension. While this process can better approximate the desired outcome, it similarly incurs the all-by-all cost that scales exponentially with dataset size.

This method performs a rough approximation which is intended to provide a useful workaround for very large datasets. It will:

Select a random subset of the dataset (param: init_n)
Split the dataset into the initialization subset and the remainder.
Perform exhaustive linkage clustering on the initialization subset to sort it
Compute the distance of the items in the remainder to the items in the sorted table
Select a subset of those items with the lowest computed distances (param: iter_n), and move those items to the sorted table
Repeat steps 4-5 until the dataset is complete. Note that the distances need only be calculated for those items which were moved to the sorted table in the last iteration.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.vscode		.vscode
roughclust		roughclust
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
test_roughclust.py		test_roughclust.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

roughclust

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

roughclust

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages