Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flexibility of Cluster (-1) - Outliers Cluster #889

Closed
miguelfrutos opened this issue Dec 19, 2022 · 2 comments
Closed

Flexibility of Cluster (-1) - Outliers Cluster #889

miguelfrutos opened this issue Dec 19, 2022 · 2 comments

Comments

@miguelfrutos
Copy link

miguelfrutos commented Dec 19, 2022

Hello everybody!.

I've been experimenting with BERTopic recently and the thing is that, once the model is trained and I visualize the number of docs that each cluster contains... the group with more docs by far is indeed the -1 (under the outlier umbrella)... therefore, if this model goes into production, many of the docs will be considered as outliers.

  1. Is there any way I can remove the clustering inside the outlier (-1) category?. Maybe going after the most similar cluster eventhough there is not enough confidence.
  2. If not, how can I reduce the -1 cluster as much as possible? Maybe with parameters such as min_cluster_size (HDBSCAN) or n_neighbors (UMAP).
  3. In the following repo, is it counting the cluster -1 for the evaluation with OCTIS?.

Many thanks in advance!! 👍

Here is my model architecture:

# Embedding model: See [1] for more details 
embedding_model = SentenceTransformer("distiluse-base-multilingual-cased-v1")

# Clustering model: See [2] for more details
cluster_model = HDBSCAN(min_cluster_size = 15, 
                        metric = 'euclidean', 
                        cluster_selection_method = 'eom', 
                        prediction_data = True)

# BERTopic model
topic_model = BERTopic(embedding_model = embedding_model,
                       hdbscan_model = cluster_model,
                       language = "multilingual")


# Fit the model on a corpus
topics, probs = topic_model.fit_transform(text)

# topic reduction
topic_model.reduce_topics(text, nr_topics=30)
@MaartenGr
Copy link
Owner

Thank you for the extensive description! There are a number of ways that you can deal with outliers. You can find the current methods here but let me go through them a bit more with respect to your use case/code.

For the topic representation itself, it is generally okay if there are outliers created as large outliers might actually hurt the topic representation step. However, for topic classification purposes this can definitely be troublesome. A nice way to handle this is by setting calculate_probabilities=True in HDBSCAN. This allows you to still generate outliers but assign them to clusters based on their soft-cluster probabilities. In other words, although a document is considered to be an outlier, you can still use its probabilities to all clusters to see which non-outlier topic it would suit best. As mentioned in the FAQ, you can use it as follows:

import numpy as np
probability_threshold = 0.01
new_topics = [np.argmax(prob) if max(prob) >= probability_threshold else -1 for prob in probs]

Moreover, it is possible to play around with both the min_samples and min_cluster_size parameters in HDBSCAN as they, to some degree, handle the number of outliers that are being created.

Finally, and this might be helpful to you in a few weeks, the upcoming v0.13 release has a number of strategies that allows you to reduce the number of outliers after having trained your model. The benefit to this is that it allows you to quickly iterate over the possible strategies, their parameters, and a possible combination of strategies without having to re-train the model. You can find the upcoming release here that you can already install with:

pip install git+https://github.com/MaartenGr/BERTopic.git@refs/pull/840/head

With respect to the outlier reduction function, there are four strategies that you can use, namely reducing through topic probabilities, topic distributions, c-TF-IDF similarity, and topic embedding similarity. The general method for performing this is as follows:

from bertopic import BERTopic

# Train your BERTopic model
topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs)

# Reduce outliers
new_topics = topic_model.reduce_outliers(docs, topics)

If you already want to play around with this feature, you can find some preliminary documentation here. I hope to release the v0.13 version in the first week of the new year.

@drob-xx
Copy link

drob-xx commented Dec 19, 2022

Maarten's response is, as always very useful. I'll add my two cents as this is an area I've spent a considerable amount of time on.

As you probably have already know the number of -1 classifications you are seeing is a function of the min_cluster_size, nr_topics and of course the underlying structure of your corpus as encoded. It may not be obvious but it is also affected by the setting for min_samples as well as whatever happens when UMAP runs - it is stochastic and run over run will produce a different solution for the dimensionality reduction of the embeddings. (NOTE: I am assuming here that besides the code you've provided you are running BERTopic in a 'default' configuration). 'min_samplesis by default set at whatevermin_cluster_size` is and you can read more about that here if you haven't already come up to speed on hdbscan tuning. I've written previously about UMAP instability in #831

In your case there are two different mechanisms that are contributing to the number of -1's - All of the HDBSCAN related settings/context - the reduced dimension vectors, the HDBSCAN settings on one side, and the BERTopic topic reduction functionality that gets triggered by setting nr_topics.

As Maarten has pointed out one way to address this would be to play with the min_cluster_size and min_samples. The issue is that this non-trivial since there can be hundreds of different combinations (easily) and the relationship between the settings is not "linear" in the sense that you can't easily narrow down a combo that works. So if a setting like 30,5 is better than 40,8 you cannot assume that 30,6 will be worse and 30,4 might be better. You essentially have to try a great number of combinations to find optimized values.

I have written a package that allows the user to do just that - TopicTuner which is meant to make this process very straight forward, comprehensive and easy to implement. It is fully documented and there is a notebook to get you going. Let me know if my design criteria have not been achieved and I will address those issues.

TopicTuner effectively bypasses the need to set nr_topics. The reduce topics functionality of BERTopic is quite handy and simple to use (if you think you really know the the optimal number of topics in your corpus), but it comes at the cost of cutting through the HDBSCAN clusters using a different approach which tries to guess at the most likely "super clusters" (my term as a shorthand for the underlying topic centroid/cosine based strategy it employs - I'm sure Maarten has a better descriptor and I may be misrepresenting the algorithm). Unfortunately for those of us wanting to preserve as much of the original documents in the topic model representation it has a tendency to label a really large number of clusters as -1. This is because of the smart, conservative approach that Maarten took - opting to reduce the possibility that documents would be positively labeled (not -1) over the chance that a document might not actually be in a particular cluster.

Yet the issue is that in practice not only are we getting lots of -1's in the output, but those -1's (in many cases I've seen) are not evenly balanced over the corpus. This can (and does often) result in entire sets of documents that would comprise a logical cluster being knocked out as well as smaller clusters becoming more prominent etc. etc. TopicTuner avoids this because it allows you to quickly come up with HDBSCAN parameters that allow you to choose a rational number of topics without having to resort to the carnage that BERTopic topic reduction can create.

You can read more about the UMAP stochastic/instability issue at #831 and TopicTuner in #788. I hope this is useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants