Is there a way to retrieve the words used to generate the tf-idf? #331

sgdantas · 2021-11-17T02:28:12Z

Hey, I saw this issue and I wanted to get the P(word|topic)
#144

You suggested accessing it using model.c_tf_idf, but I still need the words that were used to generate the sparse matrix.
By looking at the source code, I saw where that's defined, but it doesn't seem easy to access.

Is there a "standard" way to get it?

Thanks!

The text was updated successfully, but these errors were encountered:

MaartenGr · 2021-11-17T06:21:38Z

The model.c_tf_idf matrix is generated from the CountVectorizer used to model the frequencies of words. You can use model.vectorizer_model to access that vectorizer and extract the words through model.vectorizer_model.get_feature_names().

Hopefully, this helps!

sgdantas · 2021-11-17T23:16:21Z

That works, thanks Maarten!
I just have two last questions:

the sparse matrix is (num_topics +1) x (num_words), is the last row related to the topic -1?
I read on the docs that cleaning the text is not necessary, as we can often rely on the contextual embeddings to "get" the overall meaning of documents. However, when looking at the words, I see lots of numbers that don't seem very useful and for some topics, I see stopwords as the top_n_words. In this case, do you think is worth it to clean a bit the text? The overall topic quality is ok, but I wonder if I can improve it.
Thanks a lot for all your hard work, and for making it available for everyone :)

MaartenGr · 2021-11-18T08:13:42Z

No problem, glad to hear that it works!

the sparse matrix is (num_topics +1) x (num_words), is the last row related to the topic -1?

No, the first row is related to topic -1, then 0, then, 1, etc. So if you want to access the c-TF-IDF representations for topic 23, you will have to access topic_model.c_tf_idf[23+1].

In this case, do you think is worth it to clean a bit the text?

In general, it is not necessary to clean the text. However, like in your case, that does not mean that it will never be helpful. In your use case, it seems that stopwords are finding their way into the topic representations and I can definitely imagine not wanting them there.

There are two ways of approaching this. First, you can indeed clean the text up a bit. It might negatively influence the clustering quality but I would not be too worried about that. This way, you are focusing on the text directly which influences both clustering and topic representations. Second, you can focus on only changing the topic representation through using a custom CountVectorizer when instantiating BERTopic. In this vectorizer, you can set stop_words that will only remove the stopwords when creating the topic representation, not when creating the embeddings. I believe the second option will be quickest to implement and most likely result in the most improvement.

sgdantas · 2021-11-18T15:50:30Z

awesome, thanks for the suggestions!

sgdantas closed this as completed Nov 18, 2021

juli-sch mentioned this issue Mar 1, 2022

About Coherence of topic models #90

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there a way to retrieve the words used to generate the tf-idf? #331

Is there a way to retrieve the words used to generate the tf-idf? #331

sgdantas commented Nov 17, 2021

MaartenGr commented Nov 17, 2021

sgdantas commented Nov 17, 2021

MaartenGr commented Nov 18, 2021

sgdantas commented Nov 18, 2021

Is there a way to retrieve the words used to generate the tf-idf? #331

Is there a way to retrieve the words used to generate the tf-idf? #331

Comments

sgdantas commented Nov 17, 2021

MaartenGr commented Nov 17, 2021

sgdantas commented Nov 17, 2021

MaartenGr commented Nov 18, 2021

sgdantas commented Nov 18, 2021