Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Topic-word matrix #144

Closed
gouthamgithub opened this issue Jun 14, 2021 · 4 comments
Closed

Topic-word matrix #144

gouthamgithub opened this issue Jun 14, 2021 · 4 comments

Comments

@gouthamgithub
Copy link

Thank you for Bertopic, I have recently started experimenting with this model.

I am trying to evaluate the diversity of topics using Kullback-Leibler Divergence metric. Is there a way I can get topic-word distribution? (the distribution of the words of the vocabulary for each topic: dimensions: |num topics| x |vocabulary|).

@MaartenGr
Copy link
Owner

Sorry for the late response!

Although there are topic-word distributions, they actually do not fully represent the quality of the generated topic representations. What happenings is the following. After generating topic clusters, c-TF-IDF is used to generate the most interesting words per topic compared to all other topics. The resulting c-TF-IDF matrix can then be used to get the topic-word distribution.

However! After generating the c-TF-IDF matrix, MMR is used to not only diversify the topic representations but also make sure that the coherence of the topic representation increases. This means that the c-TF-IDF then is not fully representative of the topic representation. Moreover, we only extract the top n words from the c-TF-IDF matrix, all of them would make little sense in the context of MMR.

Having said all that, if you want to extract the topic-word distribution purely based on the c-TF-IDF (which isn't the true quality of BERTopic), then I would advise you to do access the topic-word distribution with model.c_tf_idf.

@gouthamgithub
Copy link
Author

Understood, thank you so much!

@firmai
Copy link

firmai commented Jun 16, 2021

I am having a hard time mapping the probabilities to the topics, it's one less row, but I don't understand the ordering. Is the probabilities in order of frequency or number? For example, in the below I am trying to find the covid documents, but fail to do so -- if you have no quick advice, I can produce a sample

image

@firmai
Copy link

firmai commented Jun 16, 2021

I should specify my purpose here: unlike Top2Vec your solution allows me to search for any keyword, which returns a topic, and how great would it be if I can use that topic outcomes to bring back a list of documents. It would be great if some makeshift piece of code can be developed for that purpose.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants