-
Notifications
You must be signed in to change notification settings - Fork 756
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Probabilities loaded from a saved model not the same as probabilities after fit_transfom() #769
Comments
The probabilities that are being generated are not the ones on which HDBSCAN bases its assignment. They are probabilities that are approximated after having created their assignments. Therefore, they might not hold the same output since they are generated differently. You can find more about that here. |
Just popping in to say that I ran into this same phenomenon today and was pretty puzzled (topic assignments are not necessarily equal to the topics for which a doc has highest probability). Followed the advice above from @MaartenGr to better understand how the topic assignment process is distinct from the probability generating process. To link two resources that I think delineate it quite well: Still stewing and percolating on my understanding of it, so I'll hold off on trying to give a summary. For now, I'd like to run some systematic manual checks to see which method of topic assignment I find more compelling when it comes to the actual text data and which topic they fit better with (the one assigned by HDBSCAN, or the one indicated by its highest probability). Preliminary look points towards highest probability working better with my intuition of what each topic means, but obviously your mileage may vary. Would love to know if @MaartenGr has experience on this, especially if there are any performance metrics he's tested along these lines. |
@brhkim It is indeed tricky that the probabilities do not directly match the cluster creation process. In practice, it really depends on your use case which method typically works best. In general, I would advise focusing on the direct assignment and not necessarily on the probabilities. The direct assignments are at the core of the model and best represent the internal processes of HDBSCAN. With respect to performance metrics, this also highly depends on the use case. Based on that, the performance metric might also change. Are you looking to minimize outliers, calculate topic coherence, topic quality, or something else? These questions change the performance metric entirely so it would be misleading to give a 1-dimensional view of what works better. |
@MaartenGr Yes, that all totally reads -- would be nice to have a straightforward path, but rarely ever the case in these instances! Helpful to hear your intuition about it, in any case. Thanks for taking the time to respond, and thanks as always for this incredible tool. |
Oh, and maybe more directly to this issue (and related ones that continue to pop up), I do wonder if (a) adding some documentation on this dynamic could be helpful, and (b) exposing a top-probability assignment as an explicit output in addition to the HDBSCAN cluster assignment (perhaps only when the probabilities output argument is on) might head off some of this confusion even in circumstances where people don't reference the documentation? |
@brhkim Adding something to the documentation would definitely be helpful. Some careful thought needs to be taken as to where this is best put. It might be worthwhile to put it in the FAQ as it might clutter some of the other guides.
I am not sure if I understand you correctly but the flat probabilities are calculated out-of-the-box. They get replaced however with topic-document probabilities if you use |
@MaartenGr Right right, but my understanding is that the flat probabilities provided when I guess my thinking is that this mismatch between topic-document probabilities and the assigned hdbscan cluster only pops up to the user in the event that they've elected To my suggestion (a) above for documentation, I agree that adding a section in the FAQ makes sense, but maybe referencing and linking that section any time you mention the option for calculate_probabilities would also be helpful. To my suggestion (b) above for the output provided by |
I am not sure about doing too much cross-referencing as that might guide the user away when it is not relevant in a guide that is not directly about those probabilities but may be mentioned there. However, making sure there are sufficient sources where these things are mentioned would definitely help!
All in all, I agree. It would be helpful if the effect of using |
* Add representation models * bertopic.representation.KeyBERTInspired * bertopic.representation.PartOfSpeech * bertopic.representation.MaximalMarginalRelevance * bertopic.representation.Cohere * bertopic.representation.OpenAI * bertopic.representation.TextGeneration * bertopic.representation.LangChain * bertopic.representation.ZeroShotClassification * Fix topic selection when extracting repr docs * Improve documentation, #769, #954, #912 * Add wordcloud example to documentation * Add title param for each graph, #800 * Improved nr_topics procedure * Fix #952, #903, #911, #965. Add #976
Due to inactivity, I'll be closing this issue. Also updated the documentation. Let me know if you want me to re-open the issue! |
I wanted to load a saved model and used the topic probabilities (instead of training again because it takes a while).
I generated
topic_assigned
by identifying the topic that has the highest probability for each document.(The
probs
output has probabilities for all topics for each document)But the topic assigned by this method does not match the topics from the model.
then I get the topic assignment with
The result of topic assigned to each document from
probs_df
is different fromtopic_assignment
.Did I not retrieve the probabilities correctly?
The text was updated successfully, but these errors were encountered: