Scikit-learn's HDBSCAN Implementation #2031

MaartenGr · 2024-06-03T10:05:20Z

In a recent version of scikit-learn, I believe it was v1.3, HDBSCAN was implemented with base functionality. Considering scikit-learn is already a requirement of BERTopic it stands to reason to use that implementation instead of the original implementation since scikit-learn has more contributors. Moreover, common installation issues related to HDBSCAN might be alleviated with this.

There are a couple of issues worth mentioning:

Calculation of probabilities is if I'm not mistaken, not implemented in scikit-learn's HDBSCAN
- A solution would be to use the cosine similarities as the default method of calculating probabilities
The feature set is smaller than the original implementation
Speed needs to be tested to identify whether this is worth it
Accuracy, whatever that means in this context, might also need some exploration

For those reading this, I'm interested to hear what you all think about this suggested change!

StarlightScribe · 2024-07-17T01:18:29Z

Anything that simplifies the dependency structure is progress. Would it affect the use of the cuML implementation? If not, you could leave that as the faster alternative.

MaartenGr · 2024-07-19T08:40:49Z

It doesn't affect the cuML implementation since it should be nothing more than a drop-in replacement of the original implementation. One major disadvantage is that currently soft-clustering is not implemented in the scikit-learn version which makes it difficult to return probabilities.

MaartenGr added enhancement New feature or request question Further information is requested labels Jun 8, 2024

MaartenGr pinned this issue Jun 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scikit-learn's HDBSCAN Implementation #2031

Scikit-learn's HDBSCAN Implementation #2031

MaartenGr commented Jun 3, 2024

StarlightScribe commented Jul 17, 2024

MaartenGr commented Jul 19, 2024

Scikit-learn's HDBSCAN Implementation #2031

Scikit-learn's HDBSCAN Implementation #2031

Comments

MaartenGr commented Jun 3, 2024

StarlightScribe commented Jul 17, 2024

MaartenGr commented Jul 19, 2024