topic -1 #485

roikremer · 2022-03-26T09:44:39Z

i use with berttopic to analyze text in social media, dividing the docs to topics and looking for some common and important words in each division.
in some case, my model input is 150,000 docs and after the transform, the frequency of topic -1 is very high( 35%-40%).

so I want to know what exactly is topic -1 and what the problem cause is...

Thank you all

MaartenGr · 2022-03-27T06:33:43Z

A feature of HDBSCAN is that it does not force datapoints in clusters but recognizes that some could be outliers. Often, outliers are found when using reduced sentence embeddings together with HDBSCAN. In practice, this should not be an issue as it helps to accurately extract the topic representations. Since there is little noise in the clusters, it becomes easier to extract relevant words.

There are two ways to deal with many outliers. First, use calculate_probabilities=True to generate a document-topic probability matrix when using transform. That way, we can assign a topic to those that are initially predicted to be -1 topics. You can use np.argmax combined with probs to force certain topics out of the outlier topic.

In other words, you could use the resulting probs to assign each topic in -1 to a non-outlier topic by simply taking the argmax of each document. It can be used like this:

probability_threshold = 0.01
new_topics = [np.argmax(prob) if max(prob) >= probability_threshold else -1 for prob in probs]

Second, in the FAQ you can find instructions for reducing the amount of outliers. Do note that lowering the number of outliers comes with a risk: if you remove all outliers in the training step then it is likely that clusters contain much more noise which could hurt the topic representation.

shimonShouei · 2022-03-28T16:05:57Z

Hi, I tried the solution of assigning the topic by the probabilities. And then I was surprised to find that It is possible that bertopic will assign a text to one topic but the max prob of the text will be another topic. What can I understand from such a situation?

MaartenGr · 2022-03-29T05:36:36Z

Yes, that might happen. I believe there is some instability between fitting the HDBSCAN model (hard-clustering) and extracting the probabilities from it (soft-clustering). From what I could gather reading through the HDBSCAN repo, although it might happen the clusters it is being assigned to should still be similar.

You can find a bit more information about that here.

aileen-reichelt · 2022-03-29T17:34:53Z

Hello, I would also like to use your proposed solution of forcing documents out of the outlier topic by using np.argmax combined with probs. Is there any way to then pass these new_topics back to the model, so that functions like the visualization functions or get_topic_info take into account the new topic assignments? Thank you in advance.

MaartenGr · 2022-03-30T06:37:16Z

@aileen-reichelt That is possible by using the .update_topics() function:

BERTopic/bertopic/_bertopic.py

Lines 657 to 661 in 681ac26

    
           def update_topics(self, 
        
                             docs: List[str], 
        
                             topics: List[int], 
        
                             n_gram_range: Tuple[int, int] = None, 
        
                             vectorizer_model: CountVectorizer = None):

Here, you can pass in your updated topics, docs, and any changes to the vectorizer if you want. Note that the difficulty with doing so is that when we force outliers into topics it is very likely this will decrease the quality of these topic representations.

roik2500 · 2022-05-16T11:40:49Z

Hi,
first of all, thank you for your response
we have tried to implement your solution to the topic -1 problem as I mention in the first comment,
and the result was the same, that means before your solution we got a huge amount of posts in topic 0 and now after your solution with argmax we got a big amount of posts in topic -1,
pic attached.

my question is if there is another solution to this problem?
any recommendation for model optimization?
our data is from Reddit
we try to optimize the model so we change the n_neighbors and min_topic_size and check the best parameters by coherence

thank you
roi

before

after

MaartenGr · 2022-05-17T05:00:13Z

@roik2500 For your particular use case, it might be interesting to use k-Means instead of HDBSCAN as that will result in fewer outliers being created. You can find the corresponding tutorial here.

roik2500 · 2022-05-17T18:36:28Z

Hi MaartenGr
I have implemented your solution to using Kmeans instead of HDBSCAN, and I got this error
can you help me with this error? maybe I did something work
my code attached.
Thank you

The code:

The Error:

MaartenGr · 2022-05-18T05:38:33Z

@roik2500 Based on your output, my guess would be that you are not using BERTopic v0.10. In that version, the feature of using different clustering models was introduced. Upgrading the version with pip install --upgrade bertopic should solve your issue.

Spekkboom · 2022-05-24T13:57:35Z

Hi @MaartenGr,

I have assigned the outliers generated by my model (with topic - 1) to the clusters based on a threshold, as you suggested above, however, when I pass docs and topics_adjusted to the update_topics() function, the topic counts have not been affected at all, and neither have the topic representations. The original topics list has around 30 000 outliers, while the topics_adjusted list has only around 4 000, so the topic counts should definitely be affected by this. The code I used is below:

prob_threshold = 0.01
new_topics = [np.argmax(prob) if max(prob) >= prob_threshold else -1 for prob in probs]

topics_adjusted = topics
for idx, topic in enumerate(topics_adjusted):
    if topic == -1:
        topics_adjusted[idx] = new_topics[idx]

vect_model = CountVectorizer(ngram_range = (1, 2), stop_words = "english")
topic_model.update_topics(docs, topics_adjusted, vectorizer_model = vect_model)
topic_model.get_topic_info()

MaartenGr · 2022-05-24T15:03:51Z

@DieSpekboom To also update the counts aside from the topic representation, you will have to run the following:

import pandas as pd
import numpy as np

# Extract new topics
probability_threshold = 0.01
new_topics = [np.argmax(prob) if max(prob) >= probability_threshold else -1 for prob in probs]

# Update the internal topic representation of the topics
# NOTE: You can skip this step if you do not want to change the topic representations
topic_model.update_topics(docs, new_topics)

# Update topic frequencies
documents = pd.DataFrame({"Document": docs, "Topic": new_topics})
topic_model._update_topic_size(documents)

._update_topic_size() has been separated from .update_topics() on purpose as updating topics was initially meant to update the topic representation only and not the corresponding frequencies. Doing so might make it less transparent as to what the model initially produced as output.

MaartenGr · 2022-07-13T08:12:48Z

Since this was active a while ago, I am going ahead and will close this issue. If, however, you want to re-open this issue, please let me know!

Syarotto · 2022-07-28T03:45:36Z

Hi!

I wonder if it's possible to remove the -1 topic entirely by setting the threshold to be 0 and get the full list of new topics? It looks like the model does not allow using .update_topics() when the number of topics changes, and I could only use .get_topic_info() to get the new distribution of labels.

And just to confirm, does the probability of topics for each document change through the process?

Thank you!

MaartenGr · 2022-07-31T05:24:22Z

@Syarotto Yes, by setting the threshold to 0 all documents will be assigned to a non-outlier topic. The problem, however, with assigning all documents to a non-outlier topic in model itself is that it will likely hurt the topic description as outliers are now found within the cluster which typically adds noise. For that reason, .update_topics() does not update the internal topic representations based on the updated topics. Instead, I would advise skipping HDBSCAN and use something like k-Means instead that does not assume outliers in data.

And just to confirm, does the probability of topics for each document change through the process?

That depends on the process you are referring to. Having said that, the probability should remain stable after training.

Syarotto · 2022-07-31T05:29:07Z

@Syarotto Yes, by setting the threshold to 0 all documents will be assigned to a non-outlier topic. The problem, however, with assigning all documents to a non-outlier topic in model itself is that it will likely hurt the topic description as outliers are now found within the cluster which typically adds noise. For that reason, .update_topics() does not update the internal topic representations based on the updated topics. Instead, I would advise skipping HDBSCAN and use something like k-Means instead that does not assume outliers in data.

And just to confirm, does the probability of topics for each document change through the process?

That depends on the process you are referring to. Having said that, the probability should remain stable after training.

Thank you for confirming! It looks like k-means does not have probability associated with that, if I got it correctly. Does it mean that we have to choose between the outlier topic and the probabilities?

MaartenGr · 2022-07-31T06:33:58Z

Thank you for confirming! It looks like k-means does not have probability associated with that, if I got it correctly.

Although it does not generate probabilities, you could calculate the distance between the point and the cluster's center as a proxy for that. How you would like to use that depends on your use case and whether you want some sort of normalization procedure for calculating the distances.

Does it mean that we have to choose between the outlier topic and the probabilities?

I am not sure, but I believe that in scikit-learn there currently are no clustering models that directly give back probabilities and model no outliers. However, you can use most clustering models in BERTopic, so if you find any that support both, you could use those.

Syarotto · 2022-07-31T06:40:30Z

Thank you for confirming! It looks like k-means does not have probability associated with that, if I got it correctly.

Although it does not generate probabilities, you could calculate the distance between the point and the cluster's center as a proxy for that. How you would like to use that depends on your use case and whether you want some sort of normalization procedure for calculating the distances.

Does it mean that we have to choose between the outlier topic and the probabilities?

I am not sure, but I believe that in scikit-learn there currently are no clustering models that directly give back probabilities and model no outliers. However, you can use most clustering models in BERTopic, so if you find any that support both, you could use those.

That's great to know, thank you so much for your help :)

MaartenGr mentioned this issue Apr 25, 2022

Still different results from argmax(probs) to topics #518

Closed

MaartenGr closed this as completed Jul 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

topic -1 #485

topic -1 #485

roikremer commented Mar 26, 2022

MaartenGr commented Mar 27, 2022

shimonShouei commented Mar 28, 2022

MaartenGr commented Mar 29, 2022

aileen-reichelt commented Mar 29, 2022

MaartenGr commented Mar 30, 2022

roik2500 commented May 16, 2022

MaartenGr commented May 17, 2022

roik2500 commented May 17, 2022

MaartenGr commented May 18, 2022

Spekkboom commented May 24, 2022 •

edited

MaartenGr commented May 24, 2022

MaartenGr commented Jul 13, 2022

Syarotto commented Jul 28, 2022 •

edited

MaartenGr commented Jul 31, 2022

Syarotto commented Jul 31, 2022

MaartenGr commented Jul 31, 2022

Syarotto commented Jul 31, 2022

topic -1 #485

topic -1 #485

Comments

roikremer commented Mar 26, 2022

MaartenGr commented Mar 27, 2022

shimonShouei commented Mar 28, 2022

MaartenGr commented Mar 29, 2022

aileen-reichelt commented Mar 29, 2022

MaartenGr commented Mar 30, 2022

roik2500 commented May 16, 2022

MaartenGr commented May 17, 2022

roik2500 commented May 17, 2022

MaartenGr commented May 18, 2022

Spekkboom commented May 24, 2022 • edited

MaartenGr commented May 24, 2022

MaartenGr commented Jul 13, 2022

Syarotto commented Jul 28, 2022 • edited

MaartenGr commented Jul 31, 2022

Syarotto commented Jul 31, 2022

MaartenGr commented Jul 31, 2022

Syarotto commented Jul 31, 2022

Spekkboom commented May 24, 2022 •

edited

Syarotto commented Jul 28, 2022 •

edited