Addressing UMAP Instability in BERTopic #831
drob-xx
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Because UMAP is stochastic we know that from one run to the next its output varies. BERTopic users often comment in the issues queue asking why different runs of BERTopic produce varying output. I call this behavior UMAP’s “instability” since the ground that we build our models on is shaky. But just how variant is UMAP from run to run and how should a topic modeler understand, and compensate (or not) for this instability?
I’ve run a series of tests and think that what I’ve found might be of interest to others. It feels like there are still a large number of unanswered questions - the main one being: how much does the nature of the data being reduced affect the instability? I don’t have a general answer for that question but what I’ve seen so far is dramatic enough to share preliminary results. I’m hoping others will consider these issues in their own work and share their experiences.
I began by using synthetic data as a stand-in for the default 384 feature embeddings BERTopic produces as a default. I expected to see a more or less linear drop-off as the number of records increased. I used a 100, 500, 1000, 2000 record progression but the results were somewhat unexpected with the performance from low to high at 100, 1000, 2000, and 500. Because I was unsure of how representative of text the synthetic data might be I switched to using a 2200 document corpus of bbc news articles segmented into 5 categories.
I ran experiments on sets of 10 and 20 of each of the documents and compared each set producing 45 unique pairs for the 10x10 grid and 190 for the 20x20. The results with this data were not dissimilar than with the synthetic data. I’m still interested in how the amount of data effects UMAP behavior. However, it also seemed relevant to simply look at how one might deal with any amount of variance regardless of the corpus size.
So while I don’t have an answer regarding the bigger question, and I don’t feel like I’m in a position to make any general claims about UMAP instability, I do have some experience and opinions about how to deal with this issue. I think that BERTopic users can decide that instability simply doesn’t matter that much and move on. Yet, if you are curious/concerned about how this issue might affect your work please read on.
Here is a heatmap out put from a run of 10 models using 2000 documents from the BBC set:
I have prepared a notebook and repo that people can use to reproduce my work. A summary of the data shows that the mean similarity is 84% -
The implication is pretty clear. With this dataset, using out-of-the-box parameters, any given run is going to drift, on average, about 15%. Again, this may not be an issue for you, but if it is then the question is “what can you do about it”. At this point my answer is to optimize HDBSCAN for each run of UMAP. I’ve found that by doing this (at least with this data) I can substantially stabilize the output.
It is worth comparing this approach to another suggestion for dealing with this issue - simply “lock-in” a UMAP output by setting the random_state attribute. While it is true that this will stabilize (completely) UMAP from run to run, all you are doing is accepting that the 15% or so of model disagreement is acceptable and moving on, that’s a reasonable approach, but it isn’t solving the underlying problem.
Another question is “how are BERTopic models practically affected by this issue? In this case, the answer I came to was “quite a lot” (in my opinion). I chose two of the above models to experiment further with - 4 and 9 because they were on the overall low end of agreement. I ran BERTopic models and then used reduce_topics to bring the number of topics in-line with the provided labels. (6 - 5 topics plus the -1). When I ran each model and produced the document embeddings visualization I got:
And
Please take a look at my notebook to get a more detailed analysis - but as you can see here, while there is some agreement in the models, there is also a lot of difference. As a matter of fact, using the similarity metric - they are about 55% similar overall.
After tuning I was able to get these two models to 99% agreement, with 9 documents uncategorized and the visualization looks like this:
I encourage you to check out the notebook I’ve prepared and would love to answer any questions and hear other people’s thoughts.
Beta Was this translation helpful? Give feedback.
All reactions