Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What if I have too many documents labelled in -1 cluster in bertopic? #1298

Closed
bawa-apaar opened this issue May 25, 2023 · 1 comment
Closed

Comments

@bawa-apaar
Copy link

bawa-apaar commented May 25, 2023

Hi Maarten

Firstly, thank you for this amazing library. I'm generating topics on multilingual dataset (mainly Russian and English). I'm reducing the number of topics to 140. After the generation of topics, I'm analysis it's quality using coherence score as you suggested in this thread. I'm getting coherence score around 0.7xx trying different hyper parameter settings. After manually checking, most of topics do make sense and are interesting.

My issue is my dataset contains 196802 posts and out of these posts, 110550 are being labeled as -1 i.e no topic being assigned. More than 50% of posts are being assigned as -1 which is not expected. Can you please suggest what can I do in such scenario?

My code for topic model is:

umap_models = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=101)
sentence_model = SentenceTransformer("paraphrase-multilingual-mpnet-base-v2")

topic_model = BERTopic(umap_model=umap_models, calculate_probabilities=False, 
                       embedding_model=sentence_model, 
                       nr_topics = 140, 
                       min_topic_size=50)

topics, _ = topic_model.fit_transform(posts)
topic_model.get_topic_info()
Screenshot 2023-05-25 at 11 56 43 AM

Kindly advice. Thank you!

@MaartenGr
Copy link
Owner

Fortunately, for the creation of topics, and the assignment of documents, it is not necessarily an issue that many of them are assigned to the -1 class if the topics that you get are good. Instead, you can use .reduce_outliers to assign them using a variety of methods to a non-outlier topic. There are many different strategies to use there. My advise would be to try a few of them out and see which one works best for you.

Another option is to use an algorithm that either does not procedure outliers, like k-Means. I have heard users stating that the dimensionality reduction algorithm, PacMAP tends to reduce the number of outliers created, so that might also be an option.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants