-
Notifications
You must be signed in to change notification settings - Fork 728
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Probabilities FAQ (and topic outcomes) #146
Comments
Hmmm, I think you may have stumbled upon a bug! Just checked it and it seems that the probabilities are not correctly mapped to the topics. After generating the topics with HDBSCAN, I updated their IDs in order to make them descending in order. This update (or mapping) was not done to the probabilities. This is why they do not seem correct. To fix this, for now, you can simply use the dictionary Let me know if this works out! |
Okay cool, thanks for the response
|
I ran across this same problem today. The code graciously posted by @firmai above did not work to remap the probabilities correctly. I had to invert the mapped_topics dictionary to get the correct remapping.
Inverting the dictionary is feasible in this case because it appears that the mapping is 1-to-1. |
Hi, have you checked if your results are all correct? I tried @apaulonis method, and there is still a small proportion of inconsistency between the argmax(probs_fix) and the returned predictions of fit_transform. |
I haven't checked the solution posted above, but I am currently working on a fix which indeed results in a small proportion that remains inconsistent. This is most likely due to some with the soft-clustering procedure in HDBSCAN. You can find some issues mentioning this problem here. I might be mistaken but there currently does not seem to be a fix for this. |
@MaartenGr can you please let us know in this channel when you map the probabilities correctly to the updated topic numbers? I would like to get the probability of a document assigned to a topic. Cause indeed, when I set nr_topics='auto', the nr of topics are 135, while the probabilities matrix has 217 columns. So indeed, it seems like the probabilities matrix is not updated with the new assigned topics. |
@PamPijnenborg21 You can already find a fix for this issue in this pull request. It should resolve the issue somewhat but there might be a small percentage of discrepancies still out there due to the issue mentioned above. |
I should specify my purpose here: unlike Top2Vec your creative solution allows me to search for any keyword, which returns a topic --- how great would it be if I can use that topic outcome to bring back a list of documents. It would be good if some makeshift piece of code can be developed for that purpose. In this process, I have run into some rudimentary problems.
I am having a hard time mapping the probabilities to the topics. First am I right in saying that probs drops -1 topic, and are otherwise in order from topic 0 to n ?
Second, none of the probabilities match the topic outcomes. For example, in the below I am trying to find the covid documents, but fail to do so -- if you have no quick advice, I can produce a sample? If this is normal behaviour, what would be your thought process to achieve the first mentioned? Thanks once more :)
Originally posted by @firmai in #144 (comment)
The text was updated successfully, but these errors were encountered: