BERTopic vs. Top2Vec #372

ma-ji · 2021-12-15T22:55:40Z

Hi @MaartenGr , thanks for making this great package!

I'm working on a multilingual topic modeling project (en+zh), and have surveyed some papers and packages on multilingual TM. I will probably choose from BERTopic and Top2Vec. Do you mind briefly introducing the advantages of BERTopic over Top2Vec? My superficial understanding is that the two packages are very similar (and you also mentioned the latter in your blog post). So when should we use BERTopic, when Top2vec?

Thank you very much!

MaartenGr · 2021-12-16T07:14:00Z

Thank you for the kind words! There are several differences between BERTopic and Top2Vec that might be interesting to you:

First, the embedding models that the models use typically differ. Top2Vec works exceptionally well if it uses Doc2Vec as it assumes that the document- and word embeddings lie in the same vector space. BERTopic, in contrast, relaxes that assumption by separating the embedding stage from the topic creation stage. Depending on your use case, one might be preferred over the other.

Second, Top2Vec takes all words that are close to the centroid of a cluster which very nicely creates coherent and interpretable topic representations. BERTopic does this a bit differently. Instead, it focuses on the cluster as a whole and tries to model the topic representation from the entire cluster. This allows the topic representations to be a bit more diverse and disregards the notion of centroids. Both methods have their advantages and disadvantages.

When to use one over the other highly depends on your use case. Without making definitive statements, BERTopic might perform better on specific types of text over Top2Vec and vice versa. Even then, I would advise testing both! There is no guarantee that one always works better for a certain use case. If Top2Vec trumps BERTopic for your specific use case, then definitely go for Top2Vec.

Having said that, if there is no difference in performance, then you might go for BERTopic as it allows you to perform variations of topic modeling techniques, such as dynamic topic modeling and guided topic modeling which is currently not found in Top2Vec.

Top2Vec, in contrast, has features such as generating word clouds and searching documents by keywords that are not included in BERTopic.

I understand that this is a bit of a non-answer but hopefully, it gives you an idea of the strengths and weaknesses of both models for you to make the choice. But definitely try out both! It may turn out that BERTopic is horrible for your dataset and Top2Vec or LDA-like models are much better!

ma-ji · 2021-12-16T15:55:02Z

This is very informational and educational, thank you so much! Will definitely cite and acknowledge your help in the paper. Also appreciate your prompt response! Will report back to you my findings once I have them.

On a separate note and from a social scientist's perspective, I would recommend summarizing these suggestions in a paper, which will be very helpful for circulation in the social science community.

ma-ji closed this as completed Dec 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BERTopic vs. Top2Vec #372

BERTopic vs. Top2Vec #372

ma-ji commented Dec 15, 2021 •

edited

MaartenGr commented Dec 16, 2021

ma-ji commented Dec 16, 2021

BERTopic vs. Top2Vec #372

BERTopic vs. Top2Vec #372

Comments

ma-ji commented Dec 15, 2021 • edited

MaartenGr commented Dec 16, 2021

ma-ji commented Dec 16, 2021

ma-ji commented Dec 15, 2021 •

edited