Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Probabilities loaded from a saved model not the same as probabilities after fit_transfom() #769

Closed
PeggyFan opened this issue Oct 7, 2022 · 9 comments

Comments

@PeggyFan
Copy link

PeggyFan commented Oct 7, 2022

I wanted to load a saved model and used the topic probabilities (instead of training again because it takes a while).
I generated topic_assigned by identifying the topic that has the highest probability for each document.
(The probs output has probabilities for all topics for each document)

But the topic assigned by this method does not match the topics from the model.

from bertopic import BERTopic
topic_model = BERTopic.load("my_model")
probs = topic_model.probabilities_
topics = topic_model._map_predictions(topic_model.hdbscan_model.labels_)

probs_df = pd.DataFrame(probs)
probs_df['main percentage'] = pd.DataFrame({'max': probs_df.max(axis=1)})
probs_df['topic_assigned'] = pd.DataFrame({'topic': probs_df.idxmax(axis=1)})  

then I get the topic assignment with

topic_assignment = pd.DataFrame({"Doc": docs_train, "Topic": topics})

The result of topic assigned to each document from probs_df is different from topic_assignment.
Did I not retrieve the probabilities correctly?

@MaartenGr
Copy link
Owner

The probabilities that are being generated are not the ones on which HDBSCAN bases its assignment. They are probabilities that are approximated after having created their assignments. Therefore, they might not hold the same output since they are generated differently. You can find more about that here.

@brhkim
Copy link

brhkim commented Oct 15, 2022

Just popping in to say that I ran into this same phenomenon today and was pretty puzzled (topic assignments are not necessarily equal to the topics for which a doc has highest probability).

Followed the advice above from @MaartenGr to better understand how the topic assignment process is distinct from the probability generating process. To link two resources that I think delineate it quite well:
https://hdbscan.readthedocs.io/en/latest/how_hdbscan_works.html
^ Describes how the clusters are ultimately created, and thus how membership in the clusters are determined
https://hdbscan.readthedocs.io/en/latest/soft_clustering_explanation.html
^ Same as what Maarten linked, covers the intuition of how the probabilities are calculated.

Still stewing and percolating on my understanding of it, so I'll hold off on trying to give a summary. For now, I'd like to run some systematic manual checks to see which method of topic assignment I find more compelling when it comes to the actual text data and which topic they fit better with (the one assigned by HDBSCAN, or the one indicated by its highest probability). Preliminary look points towards highest probability working better with my intuition of what each topic means, but obviously your mileage may vary. Would love to know if @MaartenGr has experience on this, especially if there are any performance metrics he's tested along these lines.

@MaartenGr
Copy link
Owner

@brhkim It is indeed tricky that the probabilities do not directly match the cluster creation process. In practice, it really depends on your use case which method typically works best. In general, I would advise focusing on the direct assignment and not necessarily on the probabilities. The direct assignments are at the core of the model and best represent the internal processes of HDBSCAN. With respect to performance metrics, this also highly depends on the use case. Based on that, the performance metric might also change. Are you looking to minimize outliers, calculate topic coherence, topic quality, or something else? These questions change the performance metric entirely so it would be misleading to give a 1-dimensional view of what works better.

@brhkim
Copy link

brhkim commented Oct 20, 2022

@MaartenGr Yes, that all totally reads -- would be nice to have a straightforward path, but rarely ever the case in these instances! Helpful to hear your intuition about it, in any case. Thanks for taking the time to respond, and thanks as always for this incredible tool.

@brhkim
Copy link

brhkim commented Oct 20, 2022

Oh, and maybe more directly to this issue (and related ones that continue to pop up), I do wonder if (a) adding some documentation on this dynamic could be helpful, and (b) exposing a top-probability assignment as an explicit output in addition to the HDBSCAN cluster assignment (perhaps only when the probabilities output argument is on) might head off some of this confusion even in circumstances where people don't reference the documentation?

@MaartenGr
Copy link
Owner

@brhkim Adding something to the documentation would definitely be helpful. Some careful thought needs to be taken as to where this is best put. It might be worthwhile to put it in the FAQ as it might clutter some of the other guides.

(b) exposing a top-probability assignment as an explicit output in addition to the HDBSCAN cluster assignment (perhaps only when the probabilities output argument is on) might head off some of this confusion even in circumstances where people don't reference the documentation?

I am not sure if I understand you correctly but the flat probabilities are calculated out-of-the-box. They get replaced however with topic-document probabilities if you use calculate_probabilities=True.

@brhkim
Copy link

brhkim commented Oct 24, 2022

@MaartenGr Right right, but my understanding is that the flat probabilities provided when calculate_probabilities=False is directly from hdbscan's probabilities_ right? In which case, they should always match to the cluster assignments, and no confusion is created.

I guess my thinking is that this mismatch between topic-document probabilities and the assigned hdbscan cluster only pops up to the user in the event that they've elected calculate_probabilities=True and observe the topic-document probabilities derived from their soft clustering calculations as not matching up against the cluster assignments. Thus, it seems most important for documentation and user-experience to address what happens when calculate_probabilities=True.

To my suggestion (a) above for documentation, I agree that adding a section in the FAQ makes sense, but maybe referencing and linking that section any time you mention the option for calculate_probabilities would also be helpful.

To my suggestion (b) above for the output provided by calculate_probabilities=True, I think my suggestion is to somehow make the output less intuitive to basically force the user to investigate what's going on if they try to use said output without further processing. One could imagine doing that by exposing both the flat probabilities AND the soft-clustering probabilities (so each row no longer adds up to 1 as typically expected and the column numbers no longer align perfectly to number of topics, thus requiring a closer look). You could instead automatically include the cluster assignment if one were to take the max of the topic-document probabilities for each document alongside the topic-document probabilities themselves (mismatch is then made apparent very early on, and again the structure of the output requires closer inspection to use). Perhaps in a roundabout way, my suggestion is to make the resulting output less intuitive to make it necessary for the user to acknowledge what's going on before they can proceed. Again, these wrinkles, or something like them, should only ever pop up if the user elects calculate_probabilities=True. Alternatively, you could imagine just throwing a warning instead -- maybe that's easiest? I don't do any python development, so not sure what the norms are around this sort of design question.

@MaartenGr
Copy link
Owner

To my suggestion (a) above for documentation, I agree that adding a section in the FAQ makes sense, but maybe referencing and linking that section any time you mention the option for calculate_probabilities would also be helpful.

I am not sure about doing too much cross-referencing as that might guide the user away when it is not relevant in a guide that is not directly about those probabilities but may be mentioned there. However, making sure there are sufficient sources where these things are mentioned would definitely help!

To my suggestion (b) above for the output provided by calculate_probabilities=True, I think my suggestion is to somehow make the output less intuitive to basically force the user to investigate what's going on if they try to use said output without further processing. One could imagine doing that by exposing both the flat probabilities AND the soft-clustering probabilities (so each row no longer adds up to 1 as typically expected and the column numbers no longer align perfectly to number of topics, thus requiring a closer look). You could instead automatically include the cluster assignment if one were to take the max of the topic-document probabilities for each document alongside the topic-document probabilities themselves (mismatch is then made apparent very early on, and again the structure of the output requires closer inspection to use). Perhaps in a roundabout way, my suggestion is to make the resulting output less intuitive to make it necessary for the user to acknowledge what's going on before they can proceed. Again, these wrinkles, or something like them, should only ever pop up if the user elects calculate_probabilities=True. Alternatively, you could imagine just throwing a warning instead -- maybe that's easiest? I don't do any python development, so not sure what the norms are around this sort of design question.

All in all, I agree. It would be helpful if the effect of using calculate_probabilities=True would be made more explicit such that users have a bit more intuition about what is exactly happening there. I am not sure if changing the API would be the way to go. There is the option of indeed using warnings/infos, especially when you are using verbose=True. Moreover, the most logical place for this might just be the docstrings in the API. That place is independent from access to online documentation and is closer to a users development process.

MaartenGr added a commit that referenced this issue Feb 4, 2023
@MaartenGr MaartenGr mentioned this issue Feb 8, 2023
MaartenGr added a commit that referenced this issue Feb 14, 2023
* Add representation models
  * bertopic.representation.KeyBERTInspired
  * bertopic.representation.PartOfSpeech
  * bertopic.representation.MaximalMarginalRelevance
  * bertopic.representation.Cohere
  * bertopic.representation.OpenAI
  * bertopic.representation.TextGeneration
  * bertopic.representation.LangChain
  * bertopic.representation.ZeroShotClassification
* Fix topic selection when extracting repr docs
* Improve documentation, #769, #954, #912
* Add wordcloud example to documentation
* Add title param for each graph, #800
* Improved nr_topics procedure
* Fix #952, #903, #911, #965. Add #976
@MaartenGr
Copy link
Owner

Due to inactivity, I'll be closing this issue. Also updated the documentation. Let me know if you want me to re-open the issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants