Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bertopic version 0.16.0 - probs are empty when executing with zero_shot #1962

Open
amitca71 opened this issue May 2, 2024 · 8 comments
Open

Comments

@amitca71
Copy link

amitca71 commented May 2, 2024

when executing zero shot. the following probs is empty:
topics, probs = topic_model.fit_transform(docs, embeddings)

following configuration:
from umap import UMAP

umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)
from hdbscan import HDBSCAN

hdbscan_model = HDBSCAN(min_cluster_size=200, metric='euclidean', cluster_selection_method='eom', prediction_data=True)
from sklearn.feature_extraction.text import CountVectorizer
vectorizer_model = CountVectorizer(stop_words="english")
zero_shot_topics_list=["vitiligo"]

embedding_model_name="thenlper/gte-base"
embedding_model = SentenceTransformer(embedding_model_name)
embeddings = embedding_model.encode(docs, show_progress_bar=True)
topic_model = BERTopic(

Pipeline models

embedding_model=embedding_model,
umap_model=umap_model,
hdbscan_model=hdbscan_model,
vectorizer_model = CountVectorizer(stop_words="english")
zeroshot_topic_list=zero_shot_topics_list,
zeroshot_min_similarity=.8,
calculate_probabilities=True,

representation_model=representation_model,

Hyperparameters

top_n_words=10,
verbose=True
)

topics, probs = topic_model.fit_transform(docs, embeddings)

@MaartenGr
Copy link
Owner

That's correct. The probabilities are not calculated with zero-shot topic modeling. Instead, you will have to run .transform to get the probabilities which should be fast considering you have already precalculated the embeddings

@amitca71
Copy link
Author

amitca71 commented May 8, 2024

while the predictions are 80% similiar the probabilities are in different scale when executing transform (and with much higher probability). i wanted the initial to investigate it. any idea?
i also see that the probabilities are not per the selected. i have the following code:

topics, probs =topic_model.transform(documents=docs)
the topic that is shown is not the one with the highest probability:
for example:
topics[0] => 0
topics=> array([ 0, 0, 0, 1, 16, 2, 9, 2, 0, 2, 4, 2, 32, 24, 11, 29, 1,
5, 17, 7, 0, 5, 5, 11, 0, 16, 40, 0, 0, 1, 1, 0, 1, 3,
1, 1, 0, 0, 1, 1, 9, 28, 23, 1, 17, 0, 0, 0, 0, 0, 1,
14, 14, 2, 26, 0, 1, 34, 6, 1, 19, 1, 1, 0, 15, 1, 29, 16,
0, 16, 7, 0, 26, 10, 0, 2, 3, 0, 2, 37, 1, 42, 0, 1, 2,
0, 14, 1, 17, 11, 0, 0, 0, 7, 2, 7, 42, 0, 2, 1])
np.argmax(probs[0]) =>
probs[0] => array([0.8559377 , 0.86592937, 0.82342434, 0.8237496 , 0.8335248 ,
0.83235073, 0.8062134 , 0.8262592 , 0.8173883 , 0.81415665,
0.80990183, 0.80163646, 0.80752075, 0.80812716, 0.8293375 ,
0.8245102 , 0.78974223, 0.84975386, 0.8073633 , 0.83375347,
0.8317704 , 0.8277918 , 0.81950617, 0.8009686 , 0.8113255 ,
0.8158666 , 0.81716 , 0.78191996, 0.8411818 , 0.8127432 ,
0.8328281 , 0.8182113 , 0.8217473 , 0.81575316, 0.80379105,
0.81936455, 0.83533347, 0.83695006, 0.8331021 , 0.81047076,
0.8139418 , 0.8273035 , 0.8156991 , 0.8255366 , 0.81797945],
dtype=float32)

when looking at the output of fit_transform, it looks good

@MaartenGr
Copy link
Owner

while the predictions are 80% similiar the probabilities are in different scale when executing transform (and with much higher probability). i wanted the initial to investigate it. any idea?

If you load and save the model, then you should use pickle to make sure the same underlying models are being used. Do note that .transform will still be an approximation since HDBSCAN does not create probabilities as part of its training process. It is merely a step after assignment to create the approximation. Hence, as long as you are using HDBSCAN, they might not fit the predictions perfectly.

@Bougeant
Copy link

@MaartenGr,

As of bertopic 0.16.2, it seems that the probabilities returned in case of zeroshot are actually cosine similarities, not real probabilities. This can be seen in @amitca71's example above, where all the "probability" values are close to 0.8, and the sum of "probabilities" is 36.9.

The way this works is that:

  1. When fitting the model, we call the _combine_zeroshot_topics method
  2. The _combine_zeroshot_topics method sets self.hdbscan_model to be a BaseCluster
  3. When we call transform, we test if self.hdbscan_model is a BaseCluster, and it that's the case, we set the probabilities to be the cosine similarity matrix.

Can you confirm that this is indeed the intended behavior? Or that I'm not doing something wrong?

@MaartenGr
Copy link
Owner

@Bougeant

As of bertopic 0.16.2, it seems that the probabilities returned in case of zeroshot are actually cosine similarities, not real probabilities.

I'm not sure, but this behavior is not as of 0.16.2 but was always there. Checking the source code of 0.16.0, this was always the case. Or do you see different behavior between versions?

This can be seen in @amitca71's example above, where all the "probability" values are close to 0.8, and the sum of "probabilities" is 36.9.

That example is not 0.16.2 I believe since that version was released after the post.

Can you confirm that this is indeed the intended behavior? Or that I'm not doing something wrong?

That is indeed the intended behavior. Since there are two different processes of generating the underlying topics, they probabilities cannot be merged (as they differ in distributions, what they represent, etc.). Instead, using cosine similarities to proxy the probabilities generalized well to these two different processes as it makes no assumptions on the underlying training process, whereas a cluster model does.

real probabilities.

If you are using HDBSCAN, then the probabilities it generates are approximations and not part of the underlying training process. So whether you use HDBSCAN or cosine similarities, they are both approximations/proxies.

@Bougeant
Copy link

Thanks for your reply @MaartenGr. This behaviour is indeed not specific to 0.16.2. I guess I was hoping I was doing something wrong and could get probabilities out of a model trained with a zero shot topic list.

I understand that HDBSCAN predictions and predicted probabilities are approximations, as is the case with all machine learning models.

However, I disagree with the fact that cosine similarities can be seen as probabilities since they lack the basic properties for this: they do not sum to 1 (which I guess can easily be fixed by dividing all values by their sum) and they are not properly calibrated (i.e. their value do not represent the relative frequency with which one would expect a given sample to belong to a topic). This is very obvious in @amitca71's example, where the normalised cosine similarities for all 45 topics would be found between 2.1% and 2.3%.

I wonder if we can first use a zero shot topic list to generate labels and then use a supervised topic model (trained on the labels) to compute topic probabilities. Any thoughts on this?

@MaartenGr
Copy link
Owner

@Bougeant

Thanks for your reply @MaartenGr. This behaviour is indeed not specific to 0.16.2. I guess I was hoping I was doing something wrong and could get probabilities out of a model trained with a zero shot topic list.

Thanks for sharing! Then I know that the behavior is fortunately still as intended.

I understand that HDBSCAN predictions and predicted probabilities are approximations, as is the case with all machine learning models.

My apologies, that's not what I meant. What typically happens with machine learning models is that the underlying probabilities are part of the fitting and assignment process. This means that the resulting probabilities are directly related to the training procedure. This is not the case with HDBSCAN and generates the probabilities after having created the clusters and the respective assignments. As such, these probabilities are a proxy of the underlying training procedure and therefore the same sort of process as using cosine similarities. Both are post-hoc processes that try to imitate probabilities/similarities/assignments. So the output probabilities are in both cases not part of the training process and more of an approximation than probabilities by itself already are.

However, I disagree with the fact that cosine similarities can be seen as probabilities since they lack the basic properties for this: they do not sum to 1 (which I guess can easily be fixed by dividing all values by their sum) and they are not properly calibrated (i.e. their value do not represent the relative frequency with which one would expect a given sample to belong to a topic). This is very obvious in @amitca71's example, where the normalised cosine similarities for all 45 topics would be found between 2.1% and 2.3%.

As you mention, the cosine similarities by themselves give way for probabilities by additional processing that could be done by the user. For instance, applying softmax, the method you mentioned, or even part of the soft clustering approach in HDBSCAN.

I wonder if we can first use a zero shot topic list to generate labels and then use a supervised topic model (trained on the labels) to compute topic probabilities. Any thoughts on this?

That's definitely possible and I believe indeed mostly implemented in supervised BERTopic. There is also the .approximate_distribution approach if you are only interested in the distributions.

@Bougeant
Copy link

Thanks for the clarification @MaartenGr. This is very helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants