bertopic version 0.16.0 - probs are empty when executing with zero_shot #1962

amitca71 · 2024-05-02T10:24:25Z

when executing zero shot. the following probs is empty:
topics, probs = topic_model.fit_transform(docs, embeddings)

following configuration:
from umap import UMAP

umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)
from hdbscan import HDBSCAN

hdbscan_model = HDBSCAN(min_cluster_size=200, metric='euclidean', cluster_selection_method='eom', prediction_data=True)
from sklearn.feature_extraction.text import CountVectorizer
vectorizer_model = CountVectorizer(stop_words="english")
zero_shot_topics_list=["vitiligo"]

embedding_model_name="thenlper/gte-base"
embedding_model = SentenceTransformer(embedding_model_name)
embeddings = embedding_model.encode(docs, show_progress_bar=True)
topic_model = BERTopic(

Pipeline models

embedding_model=embedding_model,
umap_model=umap_model,
hdbscan_model=hdbscan_model,
vectorizer_model = CountVectorizer(stop_words="english")
zeroshot_topic_list=zero_shot_topics_list,
zeroshot_min_similarity=.8,
calculate_probabilities=True,

representation_model=representation_model,

Hyperparameters

top_n_words=10,
verbose=True
)

topics, probs = topic_model.fit_transform(docs, embeddings)

MaartenGr · 2024-05-02T14:49:20Z

That's correct. The probabilities are not calculated with zero-shot topic modeling. Instead, you will have to run .transform to get the probabilities which should be fast considering you have already precalculated the embeddings

amitca71 · 2024-05-08T09:00:09Z

while the predictions are 80% similiar the probabilities are in different scale when executing transform (and with much higher probability). i wanted the initial to investigate it. any idea?
i also see that the probabilities are not per the selected. i have the following code:

topics, probs =topic_model.transform(documents=docs)
the topic that is shown is not the one with the highest probability:
for example:
topics[0] => 0
topics=> array([ 0, 0, 0, 1, 16, 2, 9, 2, 0, 2, 4, 2, 32, 24, 11, 29, 1,
5, 17, 7, 0, 5, 5, 11, 0, 16, 40, 0, 0, 1, 1, 0, 1, 3,
1, 1, 0, 0, 1, 1, 9, 28, 23, 1, 17, 0, 0, 0, 0, 0, 1,
14, 14, 2, 26, 0, 1, 34, 6, 1, 19, 1, 1, 0, 15, 1, 29, 16,
0, 16, 7, 0, 26, 10, 0, 2, 3, 0, 2, 37, 1, 42, 0, 1, 2,
0, 14, 1, 17, 11, 0, 0, 0, 7, 2, 7, 42, 0, 2, 1])
np.argmax(probs[0]) =>
probs[0] => array([0.8559377 , 0.86592937, 0.82342434, 0.8237496 , 0.8335248 ,
0.83235073, 0.8062134 , 0.8262592 , 0.8173883 , 0.81415665,
0.80990183, 0.80163646, 0.80752075, 0.80812716, 0.8293375 ,
0.8245102 , 0.78974223, 0.84975386, 0.8073633 , 0.83375347,
0.8317704 , 0.8277918 , 0.81950617, 0.8009686 , 0.8113255 ,
0.8158666 , 0.81716 , 0.78191996, 0.8411818 , 0.8127432 ,
0.8328281 , 0.8182113 , 0.8217473 , 0.81575316, 0.80379105,
0.81936455, 0.83533347, 0.83695006, 0.8331021 , 0.81047076,
0.8139418 , 0.8273035 , 0.8156991 , 0.8255366 , 0.81797945],
dtype=float32)

when looking at the output of fit_transform, it looks good

MaartenGr · 2024-05-09T21:02:52Z

while the predictions are 80% similiar the probabilities are in different scale when executing transform (and with much higher probability). i wanted the initial to investigate it. any idea?

If you load and save the model, then you should use pickle to make sure the same underlying models are being used. Do note that .transform will still be an approximation since HDBSCAN does not create probabilities as part of its training process. It is merely a step after assignment to create the approximation. Hence, as long as you are using HDBSCAN, they might not fit the predictions perfectly.

Bougeant · 2024-05-13T15:24:53Z

@MaartenGr,

As of bertopic 0.16.2, it seems that the probabilities returned in case of zeroshot are actually cosine similarities, not real probabilities. This can be seen in @amitca71's example above, where all the "probability" values are close to 0.8, and the sum of "probabilities" is 36.9.

The way this works is that:

When fitting the model, we call the _combine_zeroshot_topics method
The _combine_zeroshot_topics method sets self.hdbscan_model to be a BaseCluster
When we call transform, we test if self.hdbscan_model is a BaseCluster, and it that's the case, we set the probabilities to be the cosine similarity matrix.

Can you confirm that this is indeed the intended behavior? Or that I'm not doing something wrong?

MaartenGr · 2024-05-13T16:44:50Z

@Bougeant

As of bertopic 0.16.2, it seems that the probabilities returned in case of zeroshot are actually cosine similarities, not real probabilities.

I'm not sure, but this behavior is not as of 0.16.2 but was always there. Checking the source code of 0.16.0, this was always the case. Or do you see different behavior between versions?

This can be seen in @amitca71's example above, where all the "probability" values are close to 0.8, and the sum of "probabilities" is 36.9.

That example is not 0.16.2 I believe since that version was released after the post.

Can you confirm that this is indeed the intended behavior? Or that I'm not doing something wrong?

That is indeed the intended behavior. Since there are two different processes of generating the underlying topics, they probabilities cannot be merged (as they differ in distributions, what they represent, etc.). Instead, using cosine similarities to proxy the probabilities generalized well to these two different processes as it makes no assumptions on the underlying training process, whereas a cluster model does.

real probabilities.

If you are using HDBSCAN, then the probabilities it generates are approximations and not part of the underlying training process. So whether you use HDBSCAN or cosine similarities, they are both approximations/proxies.

Bougeant · 2024-05-14T08:27:36Z

Thanks for your reply @MaartenGr. This behaviour is indeed not specific to 0.16.2. I guess I was hoping I was doing something wrong and could get probabilities out of a model trained with a zero shot topic list.

I understand that HDBSCAN predictions and predicted probabilities are approximations, as is the case with all machine learning models.

However, I disagree with the fact that cosine similarities can be seen as probabilities since they lack the basic properties for this: they do not sum to 1 (which I guess can easily be fixed by dividing all values by their sum) and they are not properly calibrated (i.e. their value do not represent the relative frequency with which one would expect a given sample to belong to a topic). This is very obvious in @amitca71's example, where the normalised cosine similarities for all 45 topics would be found between 2.1% and 2.3%.

I wonder if we can first use a zero shot topic list to generate labels and then use a supervised topic model (trained on the labels) to compute topic probabilities. Any thoughts on this?

MaartenGr · 2024-05-14T10:26:58Z

@Bougeant

Thanks for your reply @MaartenGr. This behaviour is indeed not specific to 0.16.2. I guess I was hoping I was doing something wrong and could get probabilities out of a model trained with a zero shot topic list.

Thanks for sharing! Then I know that the behavior is fortunately still as intended.

I understand that HDBSCAN predictions and predicted probabilities are approximations, as is the case with all machine learning models.

My apologies, that's not what I meant. What typically happens with machine learning models is that the underlying probabilities are part of the fitting and assignment process. This means that the resulting probabilities are directly related to the training procedure. This is not the case with HDBSCAN and generates the probabilities after having created the clusters and the respective assignments. As such, these probabilities are a proxy of the underlying training procedure and therefore the same sort of process as using cosine similarities. Both are post-hoc processes that try to imitate probabilities/similarities/assignments. So the output probabilities are in both cases not part of the training process and more of an approximation than probabilities by itself already are.

However, I disagree with the fact that cosine similarities can be seen as probabilities since they lack the basic properties for this: they do not sum to 1 (which I guess can easily be fixed by dividing all values by their sum) and they are not properly calibrated (i.e. their value do not represent the relative frequency with which one would expect a given sample to belong to a topic). This is very obvious in @amitca71's example, where the normalised cosine similarities for all 45 topics would be found between 2.1% and 2.3%.

As you mention, the cosine similarities by themselves give way for probabilities by additional processing that could be done by the user. For instance, applying softmax, the method you mentioned, or even part of the soft clustering approach in HDBSCAN.

I wonder if we can first use a zero shot topic list to generate labels and then use a supervised topic model (trained on the labels) to compute topic probabilities. Any thoughts on this?

That's definitely possible and I believe indeed mostly implemented in supervised BERTopic. There is also the .approximate_distribution approach if you are only interested in the distributions.

Bougeant · 2024-05-14T12:12:19Z

Thanks for the clarification @MaartenGr. This is very helpful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bertopic version 0.16.0 - probs are empty when executing with zero_shot #1962

bertopic version 0.16.0 - probs are empty when executing with zero_shot #1962

amitca71 commented May 2, 2024 •

edited

MaartenGr commented May 2, 2024

amitca71 commented May 8, 2024 •

edited

MaartenGr commented May 9, 2024

Bougeant commented May 13, 2024

MaartenGr commented May 13, 2024

Bougeant commented May 14, 2024

MaartenGr commented May 14, 2024

Bougeant commented May 14, 2024

bertopic version 0.16.0 - probs are empty when executing with zero_shot #1962

bertopic version 0.16.0 - probs are empty when executing with zero_shot #1962

Comments

amitca71 commented May 2, 2024 • edited

Pipeline models

representation_model=representation_model,

Hyperparameters

MaartenGr commented May 2, 2024

amitca71 commented May 8, 2024 • edited

MaartenGr commented May 9, 2024

Bougeant commented May 13, 2024

MaartenGr commented May 13, 2024

Bougeant commented May 14, 2024

MaartenGr commented May 14, 2024

Bougeant commented May 14, 2024

amitca71 commented May 2, 2024 •

edited

amitca71 commented May 8, 2024 •

edited