What if I have too many documents labelled in -1 cluster in bertopic? #1298

bawa-apaar · 2023-05-25T15:50:00Z

Hi Maarten

Firstly, thank you for this amazing library. I'm generating topics on multilingual dataset (mainly Russian and English). I'm reducing the number of topics to 140. After the generation of topics, I'm analysis it's quality using coherence score as you suggested in this thread. I'm getting coherence score around 0.7xx trying different hyper parameter settings. After manually checking, most of topics do make sense and are interesting.

My issue is my dataset contains 196802 posts and out of these posts, 110550 are being labeled as -1 i.e no topic being assigned. More than 50% of posts are being assigned as -1 which is not expected. Can you please suggest what can I do in such scenario?

My code for topic model is:

umap_models = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=101)
sentence_model = SentenceTransformer("paraphrase-multilingual-mpnet-base-v2")

topic_model = BERTopic(umap_model=umap_models, calculate_probabilities=False, 
                       embedding_model=sentence_model, 
                       nr_topics = 140, 
                       min_topic_size=50)

topics, _ = topic_model.fit_transform(posts)
topic_model.get_topic_info()

Kindly advice. Thank you!

MaartenGr · 2023-05-25T16:08:04Z

Fortunately, for the creation of topics, and the assignment of documents, it is not necessarily an issue that many of them are assigned to the -1 class if the topics that you get are good. Instead, you can use .reduce_outliers to assign them using a variety of methods to a non-outlier topic. There are many different strategies to use there. My advise would be to try a few of them out and see which one works best for you.

Another option is to use an algorithm that either does not procedure outliers, like k-Means. I have heard users stating that the dimensionality reduction algorithm, PacMAP tends to reduce the number of outliers created, so that might also be an option.

MaartenGr closed this as completed Sep 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What if I have too many documents labelled in -1 cluster in bertopic? #1298

What if I have too many documents labelled in -1 cluster in bertopic? #1298

bawa-apaar commented May 25, 2023 •

edited

Loading

MaartenGr commented May 25, 2023

What if I have too many documents labelled in -1 cluster in bertopic? #1298

What if I have too many documents labelled in -1 cluster in bertopic? #1298

Comments

bawa-apaar commented May 25, 2023 • edited Loading

MaartenGr commented May 25, 2023

bawa-apaar commented May 25, 2023 •

edited

Loading