Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Common clustering API (i.e., why aren't KShiftsClustering.jl, QuickShiftClustering.jl, QuickShiftClustering.jl SpectralClustering.jl here...?) #256

Open
Datseris opened this issue Jul 11, 2023 · 7 comments
Labels

Comments

@Datseris
Copy link

We are developing front-end software which use clustering algorithms as a small part of their infrastructure. Ideally we would like users to be able to use a plethora of different clustering algorithms, especially because our application scenarios can range from a medium amount of data points (where DBSCAN is ok) to very large amount (where DBSCAN is too expensive).

At the moment, our users only have access to Clustering.jl as that's our only dependency. But I just checked and there are many packages with clustering. So I have to ask, why aren't these algorithms included here?

Or, better yet, why not define a common Clustering interface that all of these packages can satisfy and the user could just load the specific "clustering backend" and thus use any kind of clustering they want? All clustering has the same API, yet all packages use different names for the clustering functions like dbscan(data; kwargs...).

I think it would be great to define a function cluster(alg::ClusteringAlgorithm, data) and the instance of alg will have all keywords relevant to the specific algorithm.

cc @KalelR @rened @lucianolorenti

@alyst
Copy link
Member

alyst commented Jul 11, 2023

But I just checked and there are many packages with clustering. So I have to ask, why aren't these algorithms included here?
Or, better yet, why not define a common Clustering interface that all of these packages can satisfy and the user could just load the specific "clustering backend" and thus use any kind of clustering they want?

I agree that the 2nd alternative (common interface) is preferable -- we don't want to assemble all possible clustering algorithms in Clustering.jl -- that would potentially bloat its dependencies, compilation times, and make maintenance harder. Plus, it requires that the package authors agree to transfer their code here.

The common interface was not implemented so far because nobody have contributed code implementing it.
I think it is potentially a good idea to have such unification layer, so the PR is welcome.
The only concern/constraint is how much it would complicate the existing code, as we would like to keep it simple.

I think it would be great to define a function cluster(alg::ClusteringAlgorithm, data) and the instance of alg will have all keywords relevant to the specific algorithm.

I am not 100% sure that all keywords have to go to the ClusteringAlgorithm.
My initial thought is that ClusteringAlgorithm should contain all algorithm-specific parameters, while parameters like the number of clusters and maybe the distance metric could be kept as function parameters.
The idea is that in most use cases parameter-less ctor like KMeansAlgorithm() should be enough to initialize the algorithm.

@Datseris
Copy link
Author

Datseris commented Jul 11, 2023

while parameters like the number of clusters and maybe the distance metric could be kept as function parameters

To the best of my knowledge some clustering algorithms decide for themselves how many clusters there should be, such as DBSCAN, which only re-enforces the argument that this should be an argument to the cluster type.

I think it is potentially a good idea to have such unification layer, so the PR is welcome.

Sure, a PR where? Clustering.jl has too many dependencies to be used as the interface holder. We need a new package, butt where should this package live?

@alyst
Copy link
Member

alyst commented Jul 13, 2023

To the best of my knowledge some clustering algorithms decide for themselves how many clusters there should be, such as DBSCAN, which only re-enforces the argument that this should be an argument to the cluster type.

Yes, the essential part of this effort is to have an interface that covers all/most different flavors of clustering problem specifications and still provides a convenient user API.
One approach, similar to Metric type hierarchy, is to have abstract subclasses, e.g. FuzzyClusteringAlgorithm <: ClusteringAlgorithm, AffinityPropagationClusteringAlgorithm <: ClusteringAlgorithm etc.
Then it would be possible to define cluster() methods that require different keywords.
It would be also useful to have such abstract subclasses considering that the type of output for different clustering algorithms is different.

Sure, a PR where? Clustering.jl has too many dependencies to be used as the interface holder. We need a new package, butt where should this package live?

Except Distances.jl and NearestNeighbors.jl, all the dependencies are the standard Julia packages, so essentially you have them already.

The new package, e.g. ClusteringAlgorithms.jl, is possible.
There are two ways how it could be done:

  • ClusteringAlgorithms.jl is a hard dependency for all clustering packages, the dependent packages have to implement cluster() methods
  • ClusteringAlgorithms.jl is a weak dependency, cluster() methods are wrappers that call the original methods (e.g. kmeans()).
    Then Clustering.jl can have the ClusteringAlgorithmsExt extension that defines ClusteringAlgorithm-derived types for the clustering methods provided in Clustering.jl.

The downside of the latter approach is that the implementation is less straightforward.
The advantage is that you don't have to wait until given clustering package implements cluster() interface extension, you can define and use the wrapper on your side.
In terms of user API both approaches should be equivalent, and it should be possible to switch from the 2nd to the 1st in the future without breaking user API.

The new package doesn't have to live in JuliaXXX organization, especially if it is a weak dependency.

@Datseris
Copy link
Author

One approach, similar to Metric type hierarchy, is to have abstract subclasses, e.g. FuzzyClusteringAlgorithm <: ClusteringAlgorithm, AffinityPropagationClusteringAlgorithm <: ClusteringAlgorithm etc.
Then it would be possible to define cluster() methods that require different keywords.

Not sure I agree with this. The point of an interface is that it doesn't change based on who participates on the interface. For an interface to be simple and intuitive it has to be the same no matter the types. So again I would argue it is much better if all parameters go into the algorithm type and there is a single function

cluster(alg::ClusteringAlgorithm, data::AbstractArray{<:AbstractArray})

The function returns labels::AbstractArray{Int} of same shape as data.


There are two ways how it could be done:

I believe that there is a third way that is superior as it utilizes the new Julia package extensions infrastructure. ClusteringAlgorithms.jl defines the interface and exports the cluster function. The same package also have implementations of the extensions in its ext folder that get loaded via conditional loading based on backend loading. So, when doing

using ClusteringAlgorithms

you get cluster function but nothing else. When also doing using Clustering a e.g., dbscan algorithm type DBSCANClustering comes into scope that one can initialize and use with cluster.

In this way, all existing packages remain completely unaffected and ClusteringAlgorithms becomes nobodies dependency. One just has to do PRs to the common package ClusteringAlgorithms.jl to add implementations. Naturally, the downstream packages are recommended to advertise the common interface of ClusteringAlgorithms.jl in their documentation.

@alyst
Copy link
Member

alyst commented Jul 14, 2023

The point of an interface is that it doesn't change based on who participates on the interface. For an interface to be simple and intuitive it has to be the same no matter the types. So again I would argue it is much better if all parameters go into the algorithm type and there is a single function

I think we both agree that there are common parameters that are required for specific classes of clustering problems, and these parameters are distinct between these problem classes.
So one of the clustering interface goals is to make specification of the class-specific parameters simple and intuitive.
I am not against putting them into ClusteringAlgorithm subclasses. Most likely, all benefits and downsides would be more clear once we have a working draft of the interface and an example that uses it (your clustering UI).

I believe that there is a third way that is superior as it utilizes the new Julia package extensions infrastructure.

But this is exactly the 2nd alternative I have described. :) It's great that we have the same vision!

@juliohm
Copy link
Contributor

juliohm commented Jul 29, 2023

I fully support @Datseris ideas and have had similar gripes with Clustering.jl multiple times. The MLJ.jl wrappers don't improve the situation either. Currently the end-user experience is extremely bad.

If you decide to move forward with this initiative, please let me know how I can help. I would also like to comment that the idea of having a centralized repository where the implementations are maintained together is good. Clustering.jl has this role today, but the lack of maintainers and "stuckness" of this package is really compromising progress here.

I would simply start a fresh repository in JuliaML (or any organization where some of us is admin) and would start writing the most common clustering algorithms with modern idiomatic Julia. I am positive that the Julia ML community would jump in to help. This is a good GSoC project btw.

@Datseris
Copy link
Author

I've started a solution here: https://discourse.julialang.org/t/rfc-clusteringapi-jl/112258

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants