New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bertopic version 0.16.0 - probs are empty when executing with zero_shot #1962
Comments
That's correct. The probabilities are not calculated with zero-shot topic modeling. Instead, you will have to run |
while the predictions are 80% similiar the probabilities are in different scale when executing transform (and with much higher probability). i wanted the initial to investigate it. any idea? topics, probs =topic_model.transform(documents=docs) when looking at the output of fit_transform, it looks good |
If you load and save the model, then you should use |
As of bertopic The way this works is that:
Can you confirm that this is indeed the intended behavior? Or that I'm not doing something wrong? |
I'm not sure, but this behavior is not as of 0.16.2 but was always there. Checking the source code of 0.16.0, this was always the case. Or do you see different behavior between versions?
That example is not 0.16.2 I believe since that version was released after the post.
That is indeed the intended behavior. Since there are two different processes of generating the underlying topics, they probabilities cannot be merged (as they differ in distributions, what they represent, etc.). Instead, using cosine similarities to proxy the probabilities generalized well to these two different processes as it makes no assumptions on the underlying training process, whereas a cluster model does.
If you are using HDBSCAN, then the probabilities it generates are approximations and not part of the underlying training process. So whether you use HDBSCAN or cosine similarities, they are both approximations/proxies. |
Thanks for your reply @MaartenGr. This behaviour is indeed not specific to 0.16.2. I guess I was hoping I was doing something wrong and could get probabilities out of a model trained with a zero shot topic list. I understand that HDBSCAN predictions and predicted probabilities are approximations, as is the case with all machine learning models. However, I disagree with the fact that cosine similarities can be seen as probabilities since they lack the basic properties for this: they do not sum to 1 (which I guess can easily be fixed by dividing all values by their sum) and they are not properly calibrated (i.e. their value do not represent the relative frequency with which one would expect a given sample to belong to a topic). This is very obvious in @amitca71's example, where the normalised cosine similarities for all 45 topics would be found between 2.1% and 2.3%. I wonder if we can first use a zero shot topic list to generate labels and then use a supervised topic model (trained on the labels) to compute topic probabilities. Any thoughts on this? |
Thanks for sharing! Then I know that the behavior is fortunately still as intended.
My apologies, that's not what I meant. What typically happens with machine learning models is that the underlying probabilities are part of the fitting and assignment process. This means that the resulting probabilities are directly related to the training procedure. This is not the case with HDBSCAN and generates the probabilities after having created the clusters and the respective assignments. As such, these probabilities are a proxy of the underlying training procedure and therefore the same sort of process as using cosine similarities. Both are post-hoc processes that try to imitate probabilities/similarities/assignments. So the output probabilities are in both cases not part of the training process and more of an approximation than probabilities by itself already are.
As you mention, the cosine similarities by themselves give way for probabilities by additional processing that could be done by the user. For instance, applying softmax, the method you mentioned, or even part of the soft clustering approach in HDBSCAN.
That's definitely possible and I believe indeed mostly implemented in supervised BERTopic. There is also the |
Thanks for the clarification @MaartenGr. This is very helpful. |
when executing zero shot. the following probs is empty:
topics, probs = topic_model.fit_transform(docs, embeddings)
following configuration:
from umap import UMAP
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)
from hdbscan import HDBSCAN
hdbscan_model = HDBSCAN(min_cluster_size=200, metric='euclidean', cluster_selection_method='eom', prediction_data=True)
from sklearn.feature_extraction.text import CountVectorizer
vectorizer_model = CountVectorizer(stop_words="english")
zero_shot_topics_list=["vitiligo"]
embedding_model_name="thenlper/gte-base"
embedding_model = SentenceTransformer(embedding_model_name)
embeddings = embedding_model.encode(docs, show_progress_bar=True)
topic_model = BERTopic(
Pipeline models
embedding_model=embedding_model,
umap_model=umap_model,
hdbscan_model=hdbscan_model,
vectorizer_model = CountVectorizer(stop_words="english")
zeroshot_topic_list=zero_shot_topics_list,
zeroshot_min_similarity=.8,
calculate_probabilities=True,
representation_model=representation_model,
Hyperparameters
top_n_words=10,
verbose=True
)
topics, probs = topic_model.fit_transform(docs, embeddings)
The text was updated successfully, but these errors were encountered: