-
Notifications
You must be signed in to change notification settings - Fork 728
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Topic-word matrix #144
Comments
Sorry for the late response! Although there are topic-word distributions, they actually do not fully represent the quality of the generated topic representations. What happenings is the following. After generating topic clusters, c-TF-IDF is used to generate the most interesting words per topic compared to all other topics. The resulting c-TF-IDF matrix can then be used to get the topic-word distribution. However! After generating the c-TF-IDF matrix, MMR is used to not only diversify the topic representations but also make sure that the coherence of the topic representation increases. This means that the c-TF-IDF then is not fully representative of the topic representation. Moreover, we only extract the top n words from the c-TF-IDF matrix, all of them would make little sense in the context of MMR. Having said all that, if you want to extract the topic-word distribution purely based on the c-TF-IDF (which isn't the true quality of BERTopic), then I would advise you to do access the topic-word distribution with |
Understood, thank you so much! |
I am having a hard time mapping the probabilities to the topics, it's one less row, but I don't understand the ordering. Is the probabilities in order of frequency or number? For example, in the below I am trying to find the covid documents, but fail to do so -- if you have no quick advice, I can produce a sample |
I should specify my purpose here: unlike Top2Vec your solution allows me to search for any keyword, which returns a topic, and how great would it be if I can use that topic outcomes to bring back a list of documents. It would be great if some makeshift piece of code can be developed for that purpose. |
Thank you for Bertopic, I have recently started experimenting with this model.
I am trying to evaluate the diversity of topics using Kullback-Leibler Divergence metric. Is there a way I can get topic-word distribution? (the distribution of the words of the vocabulary for each topic: dimensions: |num topics| x |vocabulary|).
The text was updated successfully, but these errors were encountered: