Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Probabilities FAQ (and topic outcomes) #146

Closed
firmai opened this issue Jun 17, 2021 · 7 comments
Closed

Probabilities FAQ (and topic outcomes) #146

firmai opened this issue Jun 17, 2021 · 7 comments

Comments

@firmai
Copy link

firmai commented Jun 17, 2021

I should specify my purpose here: unlike Top2Vec your creative solution allows me to search for any keyword, which returns a topic --- how great would it be if I can use that topic outcome to bring back a list of documents. It would be good if some makeshift piece of code can be developed for that purpose. In this process, I have run into some rudimentary problems.

I am having a hard time mapping the probabilities to the topics. First am I right in saying that probs drops -1 topic, and are otherwise in order from topic 0 to n ?

Second, none of the probabilities match the topic outcomes. For example, in the below I am trying to find the covid documents, but fail to do so -- if you have no quick advice, I can produce a sample? If this is normal behaviour, what would be your thought process to achieve the first mentioned? Thanks once more :)

image

Originally posted by @firmai in #144 (comment)

@MaartenGr
Copy link
Owner

Hmmm, I think you may have stumbled upon a bug! Just checked it and it seems that the probabilities are not correctly mapped to the topics.

After generating the topics with HDBSCAN, I updated their IDs in order to make them descending in order. This update (or mapping) was not done to the probabilities. This is why they do not seem correct.

To fix this, for now, you can simply use the dictionary topic_model.mapped_topics to map the indices of the probabilities to the correct topic. For example, if you want to know which topics belong to index 0 of the probabilities, use topic_model.mapped_topics[0] to extract its corresponding topic.

Let me know if this works out!

@firmai
Copy link
Author

firmai commented Jun 17, 2021

Okay cool, thanks for the response

mapper = topic_model.mapped_topics
fixed= np.array([probs[:,mapper[r]] for r in range(probs.shape[1])]).T

@apaulonis
Copy link

I ran across this same problem today. The code graciously posted by @firmai above did not work to remap the probabilities correctly. I had to invert the mapped_topics dictionary to get the correct remapping.

rev_map = topic_model_full.mapped_topics
mapper = dict(zip(rev_map.values(),rev_map.keys()))
probs_fix = np.array([probs[:,mapper[r]] for r in range(probs.shape[1])]).T

Inverting the dictionary is feasible in this case because it appears that the mapping is 1-to-1.

@YuanyuanLi96
Copy link

I ran across this same problem today. The code graciously posted by @firmai above did not work to remap the probabilities correctly. I had to invert the mapped_topics dictionary to get the correct remapping.

rev_map = topic_model_full.mapped_topics
mapper = dict(zip(rev_map.values(),rev_map.keys()))
probs_fix = np.array([probs[:,mapper[r]] for r in range(probs.shape[1])]).T

Inverting the dictionary is feasible in this case because it appears that the mapping is 1-to-1.

Hi, have you checked if your results are all correct? I tried @apaulonis method, and there is still a small proportion of inconsistency between the argmax(probs_fix) and the returned predictions of fit_transform.

@MaartenGr
Copy link
Owner

I haven't checked the solution posted above, but I am currently working on a fix which indeed results in a small proportion that remains inconsistent. This is most likely due to some with the soft-clustering procedure in HDBSCAN. You can find some issues mentioning this problem here. I might be mistaken but there currently does not seem to be a fix for this.

@PamPijnenborg21
Copy link

@MaartenGr can you please let us know in this channel when you map the probabilities correctly to the updated topic numbers? I would like to get the probability of a document assigned to a topic. Cause indeed, when I set nr_topics='auto', the nr of topics are 135, while the probabilities matrix has 217 columns. So indeed, it seems like the probabilities matrix is not updated with the new assigned topics.

@MaartenGr
Copy link
Owner

@PamPijnenborg21 You can already find a fix for this issue in this pull request. It should resolve the issue somewhat but there might be a small percentage of discrepancies still out there due to the issue mentioned above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants