Probabilities FAQ (and topic outcomes) #146

firmai · 2021-06-17T05:32:25Z

I should specify my purpose here: unlike Top2Vec your creative solution allows me to search for any keyword, which returns a topic --- how great would it be if I can use that topic outcome to bring back a list of documents. It would be good if some makeshift piece of code can be developed for that purpose. In this process, I have run into some rudimentary problems.

I am having a hard time mapping the probabilities to the topics. First am I right in saying that probs drops -1 topic, and are otherwise in order from topic 0 to n ?

Second, none of the probabilities match the topic outcomes. For example, in the below I am trying to find the covid documents, but fail to do so -- if you have no quick advice, I can produce a sample? If this is normal behaviour, what would be your thought process to achieve the first mentioned? Thanks once more :)

Originally posted by @firmai in #144 (comment)

MaartenGr · 2021-06-17T06:00:57Z

Hmmm, I think you may have stumbled upon a bug! Just checked it and it seems that the probabilities are not correctly mapped to the topics.

After generating the topics with HDBSCAN, I updated their IDs in order to make them descending in order. This update (or mapping) was not done to the probabilities. This is why they do not seem correct.

To fix this, for now, you can simply use the dictionary topic_model.mapped_topics to map the indices of the probabilities to the correct topic. For example, if you want to know which topics belong to index 0 of the probabilities, use topic_model.mapped_topics[0] to extract its corresponding topic.

Let me know if this works out!

firmai · 2021-06-17T06:33:02Z

Okay cool, thanks for the response

mapper = topic_model.mapped_topics
fixed= np.array([probs[:,mapper[r]] for r in range(probs.shape[1])]).T

apaulonis · 2021-07-13T21:22:42Z

I ran across this same problem today. The code graciously posted by @firmai above did not work to remap the probabilities correctly. I had to invert the mapped_topics dictionary to get the correct remapping.

rev_map = topic_model_full.mapped_topics
mapper = dict(zip(rev_map.values(),rev_map.keys()))
probs_fix = np.array([probs[:,mapper[r]] for r in range(probs.shape[1])]).T

Inverting the dictionary is feasible in this case because it appears that the mapping is 1-to-1.

YuanyuanLi96 · 2021-07-18T23:35:03Z

I ran across this same problem today. The code graciously posted by @firmai above did not work to remap the probabilities correctly. I had to invert the mapped_topics dictionary to get the correct remapping.
rev_map = topic_model_full.mapped_topics
mapper = dict(zip(rev_map.values(),rev_map.keys()))
probs_fix = np.array([probs[:,mapper[r]] for r in range(probs.shape[1])]).T
Inverting the dictionary is feasible in this case because it appears that the mapping is 1-to-1.

Hi, have you checked if your results are all correct? I tried @apaulonis method, and there is still a small proportion of inconsistency between the argmax(probs_fix) and the returned predictions of fit_transform.

MaartenGr · 2021-07-19T08:28:31Z

I haven't checked the solution posted above, but I am currently working on a fix which indeed results in a small proportion that remains inconsistent. This is most likely due to some with the soft-clustering procedure in HDBSCAN. You can find some issues mentioning this problem here. I might be mistaken but there currently does not seem to be a fix for this.

PamPijnenborg21 · 2021-07-21T07:02:44Z

@MaartenGr can you please let us know in this channel when you map the probabilities correctly to the updated topic numbers? I would like to get the probability of a document assigned to a topic. Cause indeed, when I set nr_topics='auto', the nr of topics are 135, while the probabilities matrix has 217 columns. So indeed, it seems like the probabilities matrix is not updated with the new assigned topics.

MaartenGr · 2021-07-21T08:39:21Z

@PamPijnenborg21 You can already find a fix for this issue in this pull request. It should resolve the issue somewhat but there might be a small percentage of discrepancies still out there due to the issue mentioned above.

firmai closed this as completed Jun 17, 2021

MaartenGr mentioned this issue Jun 26, 2021

Mismatching between topics and probabilities #157

Closed

Hannah-key mentioned this issue Apr 25, 2022

Still different results from argmax(probs) to topics #518

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Probabilities FAQ (and topic outcomes) #146

Probabilities FAQ (and topic outcomes) #146

firmai commented Jun 17, 2021

MaartenGr commented Jun 17, 2021

firmai commented Jun 17, 2021

apaulonis commented Jul 13, 2021

YuanyuanLi96 commented Jul 18, 2021

MaartenGr commented Jul 19, 2021

PamPijnenborg21 commented Jul 21, 2021

MaartenGr commented Jul 21, 2021

Probabilities FAQ (and topic outcomes) #146

Probabilities FAQ (and topic outcomes) #146

Comments

firmai commented Jun 17, 2021

MaartenGr commented Jun 17, 2021

firmai commented Jun 17, 2021

apaulonis commented Jul 13, 2021

YuanyuanLi96 commented Jul 18, 2021

MaartenGr commented Jul 19, 2021

PamPijnenborg21 commented Jul 21, 2021

MaartenGr commented Jul 21, 2021