Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there a way to retrieve the words used to generate the tf-idf? #331

Closed
sgdantas opened this issue Nov 17, 2021 · 4 comments
Closed

Is there a way to retrieve the words used to generate the tf-idf? #331

sgdantas opened this issue Nov 17, 2021 · 4 comments

Comments

@sgdantas
Copy link

Hey, I saw this issue and I wanted to get the P(word|topic)
#144

You suggested accessing it using model.c_tf_idf, but I still need the words that were used to generate the sparse matrix.
By looking at the source code, I saw where that's defined, but it doesn't seem easy to access.

Is there a "standard" way to get it?

Thanks!

@MaartenGr
Copy link
Owner

The model.c_tf_idf matrix is generated from the CountVectorizer used to model the frequencies of words. You can use model.vectorizer_model to access that vectorizer and extract the words through model.vectorizer_model.get_feature_names().

Hopefully, this helps!

@sgdantas
Copy link
Author

That works, thanks Maarten!
I just have two last questions:

  • the sparse matrix is (num_topics +1) x (num_words), is the last row related to the topic -1?
  • I read on the docs that cleaning the text is not necessary, as we can often rely on the contextual embeddings to "get" the overall meaning of documents. However, when looking at the words, I see lots of numbers that don't seem very useful and for some topics, I see stopwords as the top_n_words. In this case, do you think is worth it to clean a bit the text? The overall topic quality is ok, but I wonder if I can improve it.
    Thanks a lot for all your hard work, and for making it available for everyone :)

@MaartenGr
Copy link
Owner

No problem, glad to hear that it works!

the sparse matrix is (num_topics +1) x (num_words), is the last row related to the topic -1?

No, the first row is related to topic -1, then 0, then, 1, etc. So if you want to access the c-TF-IDF representations for topic 23, you will have to access topic_model.c_tf_idf[23+1].

In this case, do you think is worth it to clean a bit the text?

In general, it is not necessary to clean the text. However, like in your case, that does not mean that it will never be helpful. In your use case, it seems that stopwords are finding their way into the topic representations and I can definitely imagine not wanting them there.

There are two ways of approaching this. First, you can indeed clean the text up a bit. It might negatively influence the clustering quality but I would not be too worried about that. This way, you are focusing on the text directly which influences both clustering and topic representations. Second, you can focus on only changing the topic representation through using a custom CountVectorizer when instantiating BERTopic. In this vectorizer, you can set stop_words that will only remove the stopwords when creating the topic representation, not when creating the embeddings. I believe the second option will be quickest to implement and most likely result in the most improvement.

@sgdantas
Copy link
Author

awesome, thanks for the suggestions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants