Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BERTopic vs. Top2Vec #372

Closed
ma-ji opened this issue Dec 15, 2021 · 2 comments
Closed

BERTopic vs. Top2Vec #372

ma-ji opened this issue Dec 15, 2021 · 2 comments

Comments

@ma-ji
Copy link

ma-ji commented Dec 15, 2021

Hi @MaartenGr , thanks for making this great package!

I'm working on a multilingual topic modeling project (en+zh), and have surveyed some papers and packages on multilingual TM. I will probably choose from BERTopic and Top2Vec. Do you mind briefly introducing the advantages of BERTopic over Top2Vec? My superficial understanding is that the two packages are very similar (and you also mentioned the latter in your blog post). So when should we use BERTopic, when Top2vec?

Thank you very much!

@MaartenGr
Copy link
Owner

Thank you for the kind words! There are several differences between BERTopic and Top2Vec that might be interesting to you:

First, the embedding models that the models use typically differ. Top2Vec works exceptionally well if it uses Doc2Vec as it assumes that the document- and word embeddings lie in the same vector space. BERTopic, in contrast, relaxes that assumption by separating the embedding stage from the topic creation stage. Depending on your use case, one might be preferred over the other.

Second, Top2Vec takes all words that are close to the centroid of a cluster which very nicely creates coherent and interpretable topic representations. BERTopic does this a bit differently. Instead, it focuses on the cluster as a whole and tries to model the topic representation from the entire cluster. This allows the topic representations to be a bit more diverse and disregards the notion of centroids. Both methods have their advantages and disadvantages.

When to use one over the other highly depends on your use case. Without making definitive statements, BERTopic might perform better on specific types of text over Top2Vec and vice versa. Even then, I would advise testing both! There is no guarantee that one always works better for a certain use case. If Top2Vec trumps BERTopic for your specific use case, then definitely go for Top2Vec.

Having said that, if there is no difference in performance, then you might go for BERTopic as it allows you to perform variations of topic modeling techniques, such as dynamic topic modeling and guided topic modeling which is currently not found in Top2Vec.

Top2Vec, in contrast, has features such as generating word clouds and searching documents by keywords that are not included in BERTopic.

I understand that this is a bit of a non-answer but hopefully, it gives you an idea of the strengths and weaknesses of both models for you to make the choice. But definitely try out both! It may turn out that BERTopic is horrible for your dataset and Top2Vec or LDA-like models are much better!

@ma-ji
Copy link
Author

ma-ji commented Dec 16, 2021

This is very informational and educational, thank you so much! Will definitely cite and acknowledge your help in the paper. Also appreciate your prompt response! Will report back to you my findings once I have them.

On a separate note and from a social scientist's perspective, I would recommend summarizing these suggestions in a paper, which will be very helpful for circulation in the social science community.

@ma-ji ma-ji closed this as completed Dec 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants