Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

topic -1 #485

Closed
roikremer opened this issue Mar 26, 2022 · 17 comments
Closed

topic -1 #485

roikremer opened this issue Mar 26, 2022 · 17 comments

Comments

@roikremer
Copy link

i use with berttopic to analyze text in social media, dividing the docs to topics and looking for some common and important words in each division.
in some case, my model input is 150,000 docs and after the transform, the frequency of topic -1 is very high( 35%-40%).

so I want to know what exactly is topic -1 and what the problem cause is...

Thank you all

@MaartenGr
Copy link
Owner

A feature of HDBSCAN is that it does not force datapoints in clusters but recognizes that some could be outliers. Often, outliers are found when using reduced sentence embeddings together with HDBSCAN. In practice, this should not be an issue as it helps to accurately extract the topic representations. Since there is little noise in the clusters, it becomes easier to extract relevant words.

There are two ways to deal with many outliers. First, use calculate_probabilities=True to generate a document-topic probability matrix when using transform. That way, we can assign a topic to those that are initially predicted to be -1 topics. You can use np.argmax combined with probs to force certain topics out of the outlier topic.

In other words, you could use the resulting probs to assign each topic in -1 to a non-outlier topic by simply taking the argmax of each document. It can be used like this:

probability_threshold = 0.01
new_topics = [np.argmax(prob) if max(prob) >= probability_threshold else -1 for prob in probs]

Second, in the FAQ you can find instructions for reducing the amount of outliers. Do note that lowering the number of outliers comes with a risk: if you remove all outliers in the training step then it is likely that clusters contain much more noise which could hurt the topic representation.

@shimonShouei
Copy link

Hi, I tried the solution of assigning the topic by the probabilities. And then I was surprised to find that It is possible that bertopic will assign a text to one topic but the max prob of the text will be another topic. What can I understand from such a situation?

@MaartenGr
Copy link
Owner

Yes, that might happen. I believe there is some instability between fitting the HDBSCAN model (hard-clustering) and extracting the probabilities from it (soft-clustering). From what I could gather reading through the HDBSCAN repo, although it might happen the clusters it is being assigned to should still be similar.

You can find a bit more information about that here.

@aileen-reichelt
Copy link

Hello, I would also like to use your proposed solution of forcing documents out of the outlier topic by using np.argmax combined with probs. Is there any way to then pass these new_topics back to the model, so that functions like the visualization functions or get_topic_info take into account the new topic assignments? Thank you in advance.

@MaartenGr
Copy link
Owner

@aileen-reichelt That is possible by using the .update_topics() function:

def update_topics(self,
docs: List[str],
topics: List[int],
n_gram_range: Tuple[int, int] = None,
vectorizer_model: CountVectorizer = None):

Here, you can pass in your updated topics, docs, and any changes to the vectorizer if you want. Note that the difficulty with doing so is that when we force outliers into topics it is very likely this will decrease the quality of these topic representations.

@roik2500
Copy link

Hi,
first of all, thank you for your response
we have tried to implement your solution to the topic -1 problem as I mention in the first comment,
and the result was the same, that means before your solution we got a huge amount of posts in topic 0 and now after your solution with argmax we got a big amount of posts in topic -1,
pic attached.

my question is if there is another solution to this problem?
any recommendation for model optimization?
our data is from Reddit
we try to optimize the model so we change the n_neighbors and min_topic_size and check the best parameters by coherence

thank you
roi

before
image

after
image

@MaartenGr
Copy link
Owner

@roik2500 For your particular use case, it might be interesting to use k-Means instead of HDBSCAN as that will result in fewer outliers being created. You can find the corresponding tutorial here.

@roik2500
Copy link

Hi MaartenGr
I have implemented your solution to using Kmeans instead of HDBSCAN, and I got this error
can you help me with this error? maybe I did something work
my code attached.
Thank you

The code:
image

The Error:
image

@MaartenGr
Copy link
Owner

@roik2500 Based on your output, my guess would be that you are not using BERTopic v0.10. In that version, the feature of using different clustering models was introduced. Upgrading the version with pip install --upgrade bertopic should solve your issue.

@Spekkboom
Copy link

Spekkboom commented May 24, 2022

Hi @MaartenGr,

I have assigned the outliers generated by my model (with topic - 1) to the clusters based on a threshold, as you suggested above, however, when I pass docs and topics_adjusted to the update_topics() function, the topic counts have not been affected at all, and neither have the topic representations. The original topics list has around 30 000 outliers, while the topics_adjusted list has only around 4 000, so the topic counts should definitely be affected by this. The code I used is below:

prob_threshold = 0.01
new_topics = [np.argmax(prob) if max(prob) >= prob_threshold else -1 for prob in probs]

topics_adjusted = topics
for idx, topic in enumerate(topics_adjusted):
    if topic == -1:
        topics_adjusted[idx] = new_topics[idx]

vect_model = CountVectorizer(ngram_range = (1, 2), stop_words = "english")
topic_model.update_topics(docs, topics_adjusted, vectorizer_model = vect_model)
topic_model.get_topic_info()

@MaartenGr
Copy link
Owner

@DieSpekboom To also update the counts aside from the topic representation, you will have to run the following:

import pandas as pd
import numpy as np

# Extract new topics
probability_threshold = 0.01
new_topics = [np.argmax(prob) if max(prob) >= probability_threshold else -1 for prob in probs]

# Update the internal topic representation of the topics
# NOTE: You can skip this step if you do not want to change the topic representations
topic_model.update_topics(docs, new_topics)

# Update topic frequencies
documents = pd.DataFrame({"Document": docs, "Topic": new_topics})
topic_model._update_topic_size(documents)

._update_topic_size() has been separated from .update_topics() on purpose as updating topics was initially meant to update the topic representation only and not the corresponding frequencies. Doing so might make it less transparent as to what the model initially produced as output.

@MaartenGr
Copy link
Owner

Since this was active a while ago, I am going ahead and will close this issue. If, however, you want to re-open this issue, please let me know!

@Syarotto
Copy link

Syarotto commented Jul 28, 2022

Hi!

I wonder if it's possible to remove the -1 topic entirely by setting the threshold to be 0 and get the full list of new topics? It looks like the model does not allow using .update_topics() when the number of topics changes, and I could only use .get_topic_info() to get the new distribution of labels.

And just to confirm, does the probability of topics for each document change through the process?

Thank you!

@MaartenGr
Copy link
Owner

@Syarotto Yes, by setting the threshold to 0 all documents will be assigned to a non-outlier topic. The problem, however, with assigning all documents to a non-outlier topic in model itself is that it will likely hurt the topic description as outliers are now found within the cluster which typically adds noise. For that reason, .update_topics() does not update the internal topic representations based on the updated topics. Instead, I would advise skipping HDBSCAN and use something like k-Means instead that does not assume outliers in data.

And just to confirm, does the probability of topics for each document change through the process?

That depends on the process you are referring to. Having said that, the probability should remain stable after training.

@Syarotto
Copy link

@Syarotto Yes, by setting the threshold to 0 all documents will be assigned to a non-outlier topic. The problem, however, with assigning all documents to a non-outlier topic in model itself is that it will likely hurt the topic description as outliers are now found within the cluster which typically adds noise. For that reason, .update_topics() does not update the internal topic representations based on the updated topics. Instead, I would advise skipping HDBSCAN and use something like k-Means instead that does not assume outliers in data.

And just to confirm, does the probability of topics for each document change through the process?

That depends on the process you are referring to. Having said that, the probability should remain stable after training.

Thank you for confirming! It looks like k-means does not have probability associated with that, if I got it correctly. Does it mean that we have to choose between the outlier topic and the probabilities?

@MaartenGr
Copy link
Owner

Thank you for confirming! It looks like k-means does not have probability associated with that, if I got it correctly.

Although it does not generate probabilities, you could calculate the distance between the point and the cluster's center as a proxy for that. How you would like to use that depends on your use case and whether you want some sort of normalization procedure for calculating the distances.

Does it mean that we have to choose between the outlier topic and the probabilities?

I am not sure, but I believe that in scikit-learn there currently are no clustering models that directly give back probabilities and model no outliers. However, you can use most clustering models in BERTopic, so if you find any that support both, you could use those.

@Syarotto
Copy link

Thank you for confirming! It looks like k-means does not have probability associated with that, if I got it correctly.

Although it does not generate probabilities, you could calculate the distance between the point and the cluster's center as a proxy for that. How you would like to use that depends on your use case and whether you want some sort of normalization procedure for calculating the distances.

Does it mean that we have to choose between the outlier topic and the probabilities?

I am not sure, but I believe that in scikit-learn there currently are no clustering models that directly give back probabilities and model no outliers. However, you can use most clustering models in BERTopic, so if you find any that support both, you could use those.

That's great to know, thank you so much for your help :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants