New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flexibility of Cluster (-1) - Outliers Cluster #889
Comments
Thank you for the extensive description! There are a number of ways that you can deal with outliers. You can find the current methods here but let me go through them a bit more with respect to your use case/code. For the topic representation itself, it is generally okay if there are outliers created as large outliers might actually hurt the topic representation step. However, for topic classification purposes this can definitely be troublesome. A nice way to handle this is by setting import numpy as np
probability_threshold = 0.01
new_topics = [np.argmax(prob) if max(prob) >= probability_threshold else -1 for prob in probs] Moreover, it is possible to play around with both the Finally, and this might be helpful to you in a few weeks, the upcoming v0.13 release has a number of strategies that allows you to reduce the number of outliers after having trained your model. The benefit to this is that it allows you to quickly iterate over the possible strategies, their parameters, and a possible combination of strategies without having to re-train the model. You can find the upcoming release here that you can already install with: pip install git+https://github.com/MaartenGr/BERTopic.git@refs/pull/840/head With respect to the outlier reduction function, there are four strategies that you can use, namely reducing through topic probabilities, topic distributions, c-TF-IDF similarity, and topic embedding similarity. The general method for performing this is as follows: from bertopic import BERTopic
# Train your BERTopic model
topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs)
# Reduce outliers
new_topics = topic_model.reduce_outliers(docs, topics) If you already want to play around with this feature, you can find some preliminary documentation here. I hope to release the v0.13 version in the first week of the new year. |
Maarten's response is, as always very useful. I'll add my two cents as this is an area I've spent a considerable amount of time on. As you probably have already know the number of -1 classifications you are seeing is a function of the In your case there are two different mechanisms that are contributing to the number of -1's - All of the HDBSCAN related settings/context - the reduced dimension vectors, the HDBSCAN settings on one side, and the BERTopic topic reduction functionality that gets triggered by setting As Maarten has pointed out one way to address this would be to play with the I have written a package that allows the user to do just that - TopicTuner which is meant to make this process very straight forward, comprehensive and easy to implement. It is fully documented and there is a notebook to get you going. Let me know if my design criteria have not been achieved and I will address those issues. TopicTuner effectively bypasses the need to set Yet the issue is that in practice not only are we getting lots of -1's in the output, but those -1's (in many cases I've seen) are not evenly balanced over the corpus. This can (and does often) result in entire sets of documents that would comprise a logical cluster being knocked out as well as smaller clusters becoming more prominent etc. etc. TopicTuner avoids this because it allows you to quickly come up with HDBSCAN parameters that allow you to choose a rational number of topics without having to resort to the carnage that BERTopic topic reduction can create. You can read more about the UMAP stochastic/instability issue at #831 and TopicTuner in #788. I hope this is useful. |
Hello everybody!.
I've been experimenting with BERTopic recently and the thing is that, once the model is trained and I visualize the number of docs that each cluster contains... the group with more docs by far is indeed the -1 (under the outlier umbrella)... therefore, if this model goes into production, many of the docs will be considered as outliers.
Many thanks in advance!! 👍
Here is my model architecture:
The text was updated successfully, but these errors were encountered: