Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revisiting HDBSCAN tuning and topic clustering #582

Open
drob-xx opened this issue Jun 23, 2022 · 25 comments
Open

Revisiting HDBSCAN tuning and topic clustering #582

drob-xx opened this issue Jun 23, 2022 · 25 comments

Comments

@drob-xx
Copy link

drob-xx commented Jun 23, 2022

This is a long post and I apologize going in. I also want to make clear that none of this should be read as a criticism of BERTopic or the choices made of where the product has focused its attention. In fact, after a year of using the software and having taken a deep dive to look at alternatives, I'm convinced that BERTopic is a unique project and the only end-to-end topic modeling package available. In particular, as I've mentioned before, the approach of clustering embeddings and the c-TF-IDF is really unique and powerful. That being said I think that it is worth going back and re-visiting some of the assumptions about the clustering and outliers.

I have poked around these issues in #556 where I commented about the one-way/two-way relationship between the embedded data and the actual topic model. This may sound relatively abstract - but we see it time and time again when users ask questions about how to reduce -1 categorizations and also how to get better individual document assignments from a fit model. To inform this issue I've prepared a notebook and data that I hope clearly steps through a case that is representative and compelling.

https://github.com/drob-xx/Tune_BERTopic_HDBSCAN

I have tried to make this as lightweight and usable as possible. I hope people have time to take a look.

Essentially my thesis is this:

  • Without explicitly exploring and setting HDBSCAN for at least min_samples and min_cluster_size is a complete crapshoot. Virtually any selection without testing will result in extremely sub-optimal settings for any dataset.
  • While many datasets should be thought of as having many 'outliers', there is also an argument to be made that there are no 'outliers' just some datasets that are on a spectrum from homogenous, cohesive, coherent to heterogeneous, incoherent and chaotic.
  • From what I'm seeing, assuming a reasonably coherent set of data - in my example above a set of English language general news articles of like sized length and from a small number of reputable, professional news outlets, the number of 'outliers' should be minimal and the overall coherence quite high. While I realize this may not be the case - I would argue that if topic modeling is a valid pursuit - these are the kind of results we want.

I think I well understand the overall approach that BERTopic is taking. It is entirely reasonable and in many cases more than sufficient. If I were to summarize I would say that BERTopic is optimized to have a good out-of-box experience and that the clustering only has to be 'good-enough' to get a reasonable number of cluster members on top of which c-TF-IDF can run and produce a good set of topic vocabularies. In this view, what the job is to get good enough clusters. However, there are downsides to this approach (but of course software is always about trade-offs).

I've spent a good amount of time in this area. I think that it might be possible to build a light-weight configurator/optimizer which would help users get a better out-of-box experience, and which would certainly help users wanting to reduce outliers, fit their data better and more closely associate the model to their data. It is entirely possible that I've missed something, mis-understand what is going on or am simply off on a tangent. I'm very open to constructive criticism as my only interest is in becoming a better practitioner.

I will close with a couple of charts which illustrate just how difficult it is to select reasonable parameters for a given dataset without testing. These charts are relatively dense for those unfamiliar with the format - but I think are worth paying attention to. I'm happy to explain anything here or answer questions.

This first chart show the result of running HDBSCAN against the umap reductions for the data in my example. All the BERTopic and model settings are default and the value that is changing is 'min_cluster_size'. This is the result of 62 randomized values between five and six-hundred.

We can see that there are what I call 'natural' clusters bands: of under 10, then a group between 70 and 160, and one at around 280. We can also see that there is real variation in the number of outliers created.

W B Chart 6_23_2022, 12_35_07 PM

Zooming in on the bottom grouping we see:

W B Chart 6_23_2022, 12_42_55 PM

There is no clear relationship between the min_cluster_size and the actual number of clusters identified. While in general the larger number produce smaller sizes, There are lots of cases where this is not true - no way to know unless you run a bunch of tests.

I will further complicate by adding in the effect of min_samples which further complicates the picture. This shows min_samples size (as expressed here as a percentage of min_cluster_size) having a significant, and difficult to predict, effect on the output:

W B Chart 6_23_2022, 12_48_11 PM

All of this may seem overwhelming. Also, from what I've learned the nature of topic modeling is quite perplexing. On the one hand it sits at the crux of machine learning and human cognition in a way that is very visceral. There is something about categorizing documents into human understandable that seems to me to be very powerful. On the other hand when one dives deeply into older technologies like LDA, it becomes quickly apparent that while useful, the techniques are very flawed. Within the industry, as far as I can tell, topic modeling is somewhat niche. The lack of ground-truth measures and the difficulty of easily measuring performance has understandably dampened interest. It seems to me that modeling based on the embeddings is much, much more powerful than the older approaches and deserves more attention. Please see my scatter plots in the notebook showing how logically documents are placed in relation to one another based on the embeddings. It seems that the trick is getting the clustering algorithm to 'see' the patterns, and as I point out in my notebook, it is doable, but requires some work.

I'm just getting started in this area and hope that this is interesting and not a digression.

@MaartenGr
Copy link
Owner

Awesome, thank you for sharing this! It really is great that you went out of your where to research this in detail. I think it is an excellent starting point for starting this discussion and seeing where we end up.

Before I go into detail about what you posted, I think it is important to give you a general idea of where I stand regarding this issue. As I see it, there are three paths you can take. First, as you mentioned, finding a lightweight solution for optimizing HDBSCAN. One of the things that makes this difficult, at least I think, is performance as HDBSCAN can take some time at larger datasets.

Second, is separating the creation of topics with assigning them to documents. We can argue that we do not need all documents to create accurate topic representations. At some point, whatever that is, we have enough documents to describe a topic. After describing the topic, we can focus on assigning the right topics to the right documents. However, as we have discussed before, this comes with various problems as techniques like calculate_probabilities=True is quite slow.

Third, which is what has been done thus far in the development is allowing for modularity in the usage of sub-models. Although HDBSCAN works quite well for finding clusters, other clustering algorithms can be used that might better represent the data and assumptions that you have. This also means that due to this modularity, if a model is released that generally outperforms HDBSCAN, we can simply swap out the model. As a result, and this relates to the first path, I have focused lately on making HDBSCAN not too tightly integrated into BERTopic in order to preserve its modularity.

Having said that, and as you mentioned, HDBSCAN has such a good performance that it would be a waste not to tune it, especially if tuning can have a significant effect on performance.

Before I go a bit deeper into what you posted and the code you shared, I have two quick questions:

  • You mention that the dataset contains 30k documents, but News0.csv seems to be 591 documents long, and News1.csv is 3310 documents long. Did I make a mistake somewhere?
  • Would you mind updating the Google Colab example to contain a UMAP model with random_state=42? That way, we keep the output fixed which makes it a bit easier to talk about the results.

Essentially my thesis is this:

  • Without explicitly exploring and setting HDBSCAN for at least min_samples and min_cluster_size is a complete crapshoot. Virtually any selection without testing will result in extremely sub-optimal settings for any dataset.
  • While many datasets should be thought of as having many 'outliers', there is also an argument to be made that there are no 'outliers' just some datasets that are on a spectrum from homogenous, cohesive, coherent to heterogeneous, incoherent and chaotic.
  • From what I'm seeing, assuming a reasonably coherent set of data - in my example above a set of English language general news articles of like sized length and from a small number of reputable, professional news outlets, the number of 'outliers' should be minimal and the overall coherence quite high. While I realize this may not be the case - I would argue that if topic modeling is a valid pursuit - these are the kind of results we want.

I think I well understand the overall approach that BERTopic is taking. It is entirely reasonable and in many cases more than sufficient. If I were to summarize I would say that BERTopic is optimized to have a good out-of-box experience and that the clustering only has to be 'good-enough' to get a reasonable number of cluster members on top of which c-TF-IDF can run and produce a good set of topic vocabularies. In this view, what the job is to get good enough clusters. However, there are downsides to this approach (but of course software is always about trade-offs).

Agreed. BERTopic was definitely optimized to have a good out-of-the-box experience in order to optimize for a good first representation of the topics. In many cases, for example with large datasets, some fine-tuning is necessary to get the output that you are looking for. And as you mentioned, it is in that fine-tuning where the main difficulty lies. It would be great to have a near-perfect out-of-the-box experience without hindering user experience!

In my experience, and I might be wrong here, topic modeling is often used in fields where its users are not primarily programmers or do not have much experience with Python. As a result, we want to optimize for as little tuning as necessary to get good results. Power users can make use of the modularity to fine-tune where necessary but having clear steps available to fine-tune would be nice.

I've spent a good amount of time in this area. I think that it might be possible to build a light-weight configurator/optimizer which would help users get a better out-of-box experience, and which would certainly help users wanting to reduce outliers, fit their data better and more closely associate the model to their data. It is entirely possible that I've missed something, mis-understand what is going on or am simply off on a tangent. I'm very open to constructive criticism as my only interest is in becoming a better practitioner.

To me, in order for this to succeed, lightweight is a necessity. During development, there have been features that were difficult to implement due to performance constraints. Large datasets (in the millions) make it difficult to keep such a solution lightweight but then again, I might just be pessimistic. Having said that, I do think you are on the right track and I am curious about what this would look like in practice.

Zooming in on the bottom grouping we see:

W B Chart 6_23_2022, 12_42_55 PM

There is no clear relationship between the min_cluster_size and the actual number of clusters identified. While in general the larger number produce smaller sizes, There are lots of cases where this is not true - no way to know unless you run a bunch of tests.

I might be mistaken here but it seems that in the visualization the higher the min_cluster_size the number of clusters identified either increases or stays the same. That is quite a strong relationship, right?

All of this may seem overwhelming. Also, from what I've learned the nature of topic modeling is quite perplexing. On the one hand it sits at the crux of machine learning and human cognition in a way that is very visceral. There is something about categorizing documents into human understandable that seems to me to be very powerful. On the other hand when one dives deeply into older technologies like LDA, it becomes quickly apparent that while useful, the techniques are very flawed. Within the industry, as far as I can tell, topic modeling is somewhat niche. The lack of ground-truth measures and the difficulty of easily measuring performance has understandably dampened interest. It seems to me that modeling based on the embeddings is much, much more powerful than the older approaches and deserves more attention. Please see my scatter plots in the notebook showing how logically documents are placed in relation to one another based on the embeddings. It seems that the trick is getting the clustering algorithm to 'see' the patterns, and as I point out in my notebook, it is doable, but requires some work.

Agreed, this makes the usage of topic modeling quite difficult as choosing these evaluation metrics by itself requires significant domain knowledge. Making these processes easier would help lower the barrier of entry to this kind of technique.

Quite a lot of text but I do think this is a very worthwhile investigation, especially with the amount of research you have already done!

@drob-xx
Copy link
Author

drob-xx commented Jun 24, 2022

Really happy that you think this is a worthwhile conversation.

First, as you mentioned, finding a lightweight solution for optimizing HDBSCAN

Yes, lightweight if it is ever to be something that the more casual user can handle. I see the issue here not so much as running the experiments, but interpreting them and turning that information into actual parameters. I've started playing with sklearn's model_selection tools. They provide an interface for defining experiments, running them and basic data structures for saving and retrieving the results. One of the reasons I posted this was to figure out the level of interest and decide whether I wanted to go down the sklearn path or not. I'm not entirely convinced that one needs sklearn or similar to do this - but now I'll try to figure that out. If there are other tools I'm very interested to hear about them.

Second, is separating the creation of topics with assigning them to documents. We can argue that we do not need all documents to create accurate topic representations.

Yes, this is what I've gathered from your overall comments and approach. It also seems entirely valid to me if the goal is to create the topic model and move forward as opposed to going back and forth. There is some threshold beyond which the value of including more documents in the cluster is minimal. Having said that, if the relative cost of getting a better set of representative documents is low, then why not? I think it is also important to keep in mind the flip side - we don't know what we don't know. In other words, what my initial investigations seem to show is that fairly minor changes in HDBSCAN's parameters can mean the difference between losing an entire set of documents that the user may not even know existed. Specifically I've seen cases where settings will differentiate between 7 and 8 clusters. Depending on the specific circumstances you could have one 7 cluster setting which integrates the 8th cluster into one of the 7 or you could have settings which exclude the cluster entirely. There is no way to know by just looking at the numbers. Having said all this I totally agree that there should be some 'basic' ways to essentially help the low-investment user avoid disaster while not overburdening or confusing the issue unnecessarily.

Third, which is what has been done thus far in the development is allowing for modularity in the usage of sub-models.

Yes, fully understood and agreed. I've started with HDBSCAN b/c it is the default that 'ships' with the product and, from what I can tell, does a commendable job. In the short-term my goal is to make sure I understand the problem space well enough to be able to figure out the parameters of a reasonable, ideally generalizable, solution. Long-term anything that works for a lot of use-cases will have to parallel the modular approach that you have already established.

You mention that the dataset contains 30k documents, but News0.csv seems to be 591 documents long, and News1.csv is 3310 documents long. Did I make a mistake somewhere?

Not sure exactly what happened here, although I suspect it was an issue with the indicies on the DataFrames I was using. At any rate I've removed those files from the repository (size issues) and put up versions on Kaggle. There are now links to the files in the notebook. Please let me know if you have any other issues, I think that taking the time to run the notebook will yield fruit.

Would you mind updating the Google Colab example to contain a UMAP model with random_state=42? That way, we keep the output fixed which makes it a bit easier to talk about the results.

Done. Let me know if there is a problem with the way I did it or if you have any other changes that should be made.

I might be mistaken here but it seems that in the visualization the higher the min_cluster_size the number of clusters identified either increases or stays the same. That is quite a strong relationship, right?

Overall that is correct, however the devil is in the details. Here is a zoom into part of the first dataset -

W B Chart 6_24_2022, 9_07_35 AM

It is certainly the case from this graph that for this data, as min_cluster_size increases the number of clusters decrease. With this information a casual user might just say "ok, I'll try 150 and 200 and see what happens. I don't have the precise output but from this graph we can surmise that 150 will get them 7 or 6 clusters and 200 will likely get them 4 clusters. They will get two results (7 or 6 and 4) but they won't know that two other options exist. Of course they can then run the models and compare outputs (although in actuality we don't have very good tools for really comparing the quality of these three different models).

Also, what might not be apparent is why 7 or 6? I know this dataset pretty well now and if I'm not mistaken the 6 model discards an entire cluster, which is why the 6 cluster model has more outliers than the 7. Also with the three cluster sizes in hand the user will not know that there is also a very interesting setting of 172 which provides 5 clusters and the fewest total outliers. Yet, if they stumbled on 177 they would get 5 clusters and 850 outliers (which I'm pretty sure is a very different configuration than the other 5 option).

Lastly and importantly if you take a look at the second series it integrates min_samples which makes things much less straight forward while holding out the promise of a level of tuning that is very powerful. We can certainly say that 'most casual users' will suffice with the default which is to set min_samples to 100% of min_cluster_size - however, this leaves a lot of interesting permutations off the table. Easily providing the flexibility of using these parameters more fully without totally confusing or frustrating the user is the task at hand. When I started this line of inquiry I was thinking 'ok - so if I think that my data should have 7 topics, is there a more efficient, accurate way of doing this than using BERTopic.reduce_topics()?

I will continue to trample the grass down this particular path and we'll see what happens. In closing I'd like to return to one of my previous requests - a discussion forum. I fully understand and appreciate your reasoning around not starting a discussion forum. It may well be that it turns into a distraction and makes it more difficult to manage the project. However, I think there may be opportunities that are missed without one. One of the things needed to make any generalizable solution a reality is more use-cases and datasets (and brains working on the problem). While it is true that anyone can come to this issue read and comment - I think that few people will ever know it exists. I myself try to follow the new issues for BERTopic and often search through the issues for answers to my questions. However, I know I miss a lot and I'm sure very few people go through these in detail. Even for those who do go through them, fewer still will understand this as a dialog about this subject writ large that they should freely participate in - that's just the nature of 'issues' vs. 'discussions'. Lastly, I've been hard-pressed in the last year to find a community of active topic modelers to cluster into. There is a new NLP based discord, but traffic is scant and the number of topic modelers miniscule. The web needs a hangout for topic modelers and I think that topic modeling with transformers / clustering / c-TF-IDF may well be the future and that BERTopic is leading the way. A discussion forum may be a way to create some critical mass.

@MaartenGr
Copy link
Owner

@drob-xx Apologies for the late reply!

Also, what might not be apparent is why 7 or 6? I know this dataset pretty well now and if I'm not mistaken the 6 model discards an entire cluster, which is why the 6 cluster model has more outliers than the 7. Also with the three cluster sizes in hand the user will not know that there is also a very interesting setting of 172 which provides 5 clusters and the fewest total outliers. Yet, if they stumbled on 177 they would get 5 clusters and 850 outliers (which I'm pretty sure is a very different configuration than the other 5 option).

I can imagine the difficulty is also which level of analysis to look at. In practice, the reason why typically lies at the mathematical level, how exactly does HDBSCAN does it clustering and what does its input look like? It does not always translate that easily, unfortunately, from data to output. It would definitely be nice though if there was some more intuition to this for many of the users. I think this might also be a reason why many are using k-Means instead, it's quite a bit more straightforward.

Lastly and importantly if you take a look at the second series it integrates min_samples which makes things much less straight forward while holding out the promise of a level of tuning that is very powerful. We can certainly say that 'most casual users' will suffice with the default which is to set min_samples to 100% of min_cluster_size - however, this leaves a lot of interesting permutations off the table. Easily providing the flexibility of using these parameters more fully without totally confusing or frustrating the user is the task at hand. When I started this line of inquiry I was thinking 'ok - so if I think that my data should have 7 topics, is there a more efficient, accurate way of doing this than using BERTopic.reduce_topics()?

Yes, I very much agree with this. The out-of-the-box experience should be good enough but making it easier for users to optimize parameters would be nice. Having said that, then I would expect optimization to also be necessary for non-HDBSCAN models, so some generalization in the procedure would be worthwhile.

I will continue to trample the grass down this particular path and we'll see what happens. In closing I'd like to return to one of my previous requests - a discussion forum. I fully understand and appreciate your reasoning around not starting a discussion forum. It may well be that it turns into a distraction and makes it more difficult to manage the project. However, I think there may be opportunities that are missed without one. One of the things needed to make any generalizable solution a reality is more use-cases and datasets (and brains working on the problem). While it is true that anyone can come to this issue read and comment - I think that few people will ever know it exists. I myself try to follow the new issues for BERTopic and often search through the issues for answers to my questions. However, I know I miss a lot and I'm sure very few people go through these in detail. Even for those who do go through them, fewer still will understand this as a dialog about this subject writ large that they should freely participate in - that's just the nature of 'issues' vs. 'discussions'. Lastly, I've been hard-pressed in the last year to find a community of active topic modelers to cluster into. There is a new NLP based discord, but traffic is scant and the number of topic modelers miniscule. The web needs a hangout for topic modelers and I think that topic modeling with transformers / clustering / c-TF-IDF may well be the future and that BERTopic is leading the way. A discussion forum may be a way to create some critical mass.

You do make a convincing argument with respect to the discussions page! As you mentioned, it does not take away my initial worries with respect to the time needed to be spent also going through the discussions but it might be worthwhile to at least open it up and try it out. If it is going to be used in the way that you imagine, then it was a worthwhile exercise and if not, we can just as easily close it up again. It would be great if there was a place to also focus a bit more on discussing use cases, demos, parameter tuning, etc.

@drob-xx
Copy link
Author

drob-xx commented Jul 5, 2022

Apologies for the late reply!

@MaartenGr Not an issue :)

Having said that, then I would expect optimization to also be necessary for non-HDBSCAN models, so some generalization in the procedure would be worthwhile.

haha. First things first - let me get my head around HDBSCAN (if that is possible) - I hear ya that it might be necessary to provide a solution for (every?) other clustering mechanisms.

I've taken a first pass at a proof of concept.

This notebook gens a BERTopic model then runs HDBSCAN 60 times using a random input of params. There is a graph to show that output. Then a couple of steps to choose some models to visualize. Let me know if you have time to look at this - I think it is pretty self-explanatory but I'm happy to go over it in detail if you care to.

  • It is muti-step but pretty straight forward
  • There are no dependencies other than plotly - but that isn't strictly necessary
  • The only 'trick' is getting the original BERT embeddings to create a nice TSNE viz of the embeddings data.

At this point I would be interested in working on this as a possible feature for BERTopic (or utility tool). I totally understand if it is out of scope for you however. If you don't think that's a good direction right now please let me know so I can figure out what I'm doing next :)

it might be worthwhile to at least open it up and try it out. If it is going to be used in the way that you imagine, then it was a worthwhile exercise and if not, we can just as easily close it up again. It would be great if there was a place to also focus a bit more on discussing use cases, demos, parameter tuning, etc.

Very cool - thanks for being willing to give it a try - we'll see what happens....

@cedivad
Copy link

cedivad commented Jul 6, 2022

HDBSCAN is important – but don't forget UMAP! I'm still trying to optimise my parameters, tried about 15 runs with different parameters yesterday and none of it was particularly successful, so I'm moving on to playing with UMAP – and I found this nice page:

https://pair-code.github.io/understanding-umap/

Btw – if you're running a lot of intensive runs – I've also found it practical to use cuML UMAP for reduction, saving that model and then running it many times on multiple CPU servers; but don't run the entire BERTopic fit, just run topic_model_gpu._cluster_embeddings(umap_embeddings, documents), in my case this saves like 80% of the time.

@drob-xx
Copy link
Author

drob-xx commented Jul 6, 2022

@cedivad I'm collecting use-cases to better understand the breadth of this issue. Would you be willing to share some of your project details so I can get a better idea of what is needed?

@cedivad
Copy link

cedivad commented Jul 6, 2022

I'm happy to share anything you need but I don't think it might be of much use. I'm working with an heterogeneous collection of ~300M threads from narkive.com.

My latest insight is that I need n_neighbors to be a significant percentage of my UMAP training size (as it previously was set to just 150 on a 4M dataset and it's now 2000 on a 400k dataset). I'm following your insight above and trying to generate as many runs as possible to see if I can find the correct metaparameters for my job.

I could share my scripts? If the size of your collection is big enough the speedup over a raw BERTopic call is important, and iterating quickly is important as we are learning. But I'm guessing simplicity is king on smaller datasets.

@drob-xx
Copy link
Author

drob-xx commented Jul 6, 2022

@cedivad If it's ok with you I've taken the liberty of moving this convo to #600

@drob-xx
Copy link
Author

drob-xx commented Jul 16, 2022

@MaartenGr bump

@MaartenGr
Copy link
Owner

@drob-xx Apologies for the late reply!

I've taken a first pass at a proof of concept.

Thanks for working on this! Optimization is still a tricky subject and it definitely would be nice to have some optimizer function that can handle this. I do think that developing it should not be underestimated. Since the core of BERTopic is focused on modularity, we want something like this to be extensible to any other model. As a result, development grows rather quickly. For example, support for non-HDBSCAN models, optimization for dimensionality reduction models, efficient scalability (optimization for millions of documents), scikit-learn implementation (e.g., GridSearch), etc.

I believe this issue becomes even more pronounced when looking at the most core question, what are we trying to evaluate? If we are focused on optimizing HDBSCAN, then that would inhibit the focus on modularity. If we are not focused on optimizing HDBSCAN, then that opens up a slew of evaluation metrics that can be implemented.

Having said all that, if we could condense what you did in the notebook to a very minimal function, then I think it would be worthwhile to add it to the Tips & Tricks. That way, it is not officially supported in BERTopic but users can still do some optimization. That prevents opening up the necessity to support every and all models, evaluation metrics, use cases, etc.

@drob-xx
Copy link
Author

drob-xx commented Jul 18, 2022

@MaartenGr I think I'm in sync with your observations, concerns and approach. At this point I think it would be prudent to limit the scope to tuning HDBSCAN only. My main interest right now is in optimizing the clustering to achieve the best results possible, What I think will work is an optimizer class that works along the lines of the proof of concept, allows the user to submit their model, experiment with different HDBSCAN settings and choose settings which work best for their data / use-case.

I would like to put together an alpha next to get your (and hopefully other's) feedback. I don't think this will work as a bunch of snippets - I'm pretty sure it should be a cohesive set of functions (Class) and would have to reside in github somewhere. At this point I'm using TSNE to create the 2D embeddings for visualizing the clustering, but of course TSNE is relatively slow. I like it because it produces a more balanced, symmetrical dataset which I think is easier for less-technical users to understand. However, there is no reason I can't use UMAP.

Any thoughts/preferences/questions before I take a pass at a ver .1?

@MaartenGr
Copy link
Owner

At this point I'm using TSNE to create the 2D embeddings for visualizing the clustering, but of course TSNE is relatively slow. I like it because it produces a more balanced, symmetrical dataset which I think is easier for less-technical users to understand. However, there is no reason I can't use UMAP.

I think it might be best to match the dimensionality reduction algorithm with the one used in optimization. For example, if you are using PCA in BERTopic, the using PCA for the 2D-dimensionality reduction will give you more accurate information than a different algorithm. If you were using t-SNE for 2D visualization whilst BERTopic is using PCA, then I am not sure how informative the 2D visualization is when it comes to optimizing PCA.

Any thoughts/preferences/questions before I take a pass at a ver .1?

I am not yet sure where it should land within BERTopic. I think that it might be best to approach it as a separate optimization library first seeing as it would simply open up too many evaluation techniques that should be implemented in BERTopic.

Also, now that I think about it, would it make sure to actually use sklearn.model_selection.GridSearchCV` as a base for this? It might make it much easier and cleaner as it would allow you to define custom scoring functions.

@drob-xx
Copy link
Author

drob-xx commented Jul 20, 2022

All good. Interesting on the relation of visualization reductions and their relationship to the reduction used to pre-process for clustering. Basically I think that each of these points could be a long conversation.

I hear you on using GridSearchCV - have more to say on the topic at an appropriate time. I'm not using it right now, but there is no reason it couldn't be slotted in without too much (if any) disruption.

Overall, based on what I understand your approach and concerns to be I think the best thing is for me to produce some more code and then we have something more concrete to look at. In terms of:

I am not yet sure where it should land within BERTopic. I think that it might be best to approach it as a separate optimization library first seeing as it would simply open up too many evaluation techniques that should be implemented in BERTopic.

Totally understood. That is a call for you to make down the road. Right now it is enough for me to continue to know that you will take a look and provide feedback as this progresses. I should have something more within a week or two.

@drob-xx
Copy link
Author

drob-xx commented Jul 23, 2022

@MaartenGr I have completed a very preliminary version of 'TopicTuner' - it is a thin wrapper around a BERTopic instance. You can see everything here:

https://github.com/drob-xx/TopicTuner

And there is a notebook that steps through the features. This pass has all the functionality that I am currently thinking should be included, and it works on the 20Newsgroup docs. No reason to think it wouldn't work with any BERTopic instance. It is a bit fragile, so don't bang too hard yet. The code itself needs quite a bit of work - but I would like to pin down the functionality before continuing to work on it. Let me know what you think.

@dimitry12
Copy link

This is a very insightful conversation. Thank you @drob-xx for starting it.

Second, is separating the creation of topics with assigning them to documents. We can argue that we do not need all documents to create accurate topic representations. At some point, whatever that is, we have enough documents to describe a topic. After describing the topic, we can focus on assigning the right topics to the right documents. However, as we have discussed before, this comes with various problems as techniques like calculate_probabilities=True is quite slow.

This is where BERTopic really clicked for me, thank you @MaartenGr. If the goal of BERTopic is to produce topic descriptions, then HDBSCAN marking lots of documents as outliers isn't a problem. HDBSCAN's labels are not the goal/output, only c-TF-IDF is.

Yet, in practice we need to assign documents to topics: either the same documents as in the "training" set or new unseen documents. We have a choice:

  1. HDBSCAN's labels. But these aren't good, and were never intended to be good for this task. In my opinion, the goal of HDBSCAN tuning is getting to high/granular "natural" count of topics, and not minimizing outliers.
  2. HDBSCAN's membership_vectors (aka topic-document probabilities table), which is widely used by this community.
  3. c-TF-IDF itself.

What about c-TF-IDF vs membership_vectors for assigning documents to topics? I haven't evaluated yet, mostly because I am doing topic modeling on images-dataset and working with CLIP embeddings instead of BERT embeddings. I tend to think membership_vectors are better because they account for HDBSCAN's exemplars (from which c-TF-IDF is ultimately derived from) but also extra data (tree) which isn't reflected in c-TF-IDF.

@drob-xx
Copy link
Author

drob-xx commented Jul 28, 2022

@dimitry12 Thanks for joining the conversation. I think that we should be careful to make a fine distinction between what I'm suggesting in this thread - "let's get the best HDBSCAN results we can - because that will necessarily produce better topic models" from "how do we classify documents within topic categories" which is what I believe you are getting at. I'm not saying that there is a problem with discussing these two issues - they are related and important - I just want to flag that they are different issues.

I haven't used membership_vectors very much because they are computationally expensive so I can't speak directly to their accuracy vis-a-vis document-topic assignments. However, I've played around with a myriad of different schemes to take the product of the topic model - both the vocabularies as well summarized vector representations of the topic model and I have found them to be quite imprecise - in other words they seem to generally catch most of the documents - but then are quite poor around the edges. My current assumption is that there isn't a good way of 'reverse-engineering' the c-TF-IDF output to achieve "better" topic assignments. I'm curious to hear what @MaartenGr has to say (as usual :) ).

@drob-xx
Copy link
Author

drob-xx commented Aug 26, 2022

@MaartenGr I'm revisiting this although it seems like you are not very interested in exploring this particular path. I've created a basic class that wraps BERTopic and allows for quickly evaluating HDBSCAN parameters. I used it to generate a default BERTopic model and then tuned the model using my technique down to 10 topics. I then compared the output with a base BERTopic model with nr_topics=10. The results are below.

As I've acknowledged previously there is a good argument for doing reductions the way BERTopic is currently doing them. However, I think the results below speak for themselves in that there are advantages and use cases which would greatly benefit from an alternative, reliable, fast, and intuitive way to tune HDBSCAN to produce an optimized result. I think my approach does that although still needs work. Let me know what you think.

This is a TSNE projection of a BERTopic nr_topics=10 version of the 20_NewsGroup dataset:

Capture

And again with -1 docs removed:

Capture

And here is a 'tuned' 10 topic projection:

Capture

And with the (small number) of -1 docs removed:

Capture

The documents by topic are:

BERTopic nr_topics=10

-1    11979
 0     1882
 1      776
 2      684
 3      486
 4      462
 5      459
 6      425
 7      418
 8      405
 9      355

And in the tuned model:

5    5989
-1    2048
 0    1834
 4    1814
 7    1664
 9    1481
 2     916
 3     792
 1     744
 6     627

@MaartenGr
Copy link
Owner

I'm revisiting this although it seems like you are not very interested in exploring this particular path.

My apologies for this late reply. I can understand that my absence in this thread has been understood as uninterest. Do note that this is not the case. Due to personal reasons, I have less time than I used to with diving deep into these kinds of features. Hopefully, I will have more time in the future to go through the notebook and code you provided but I cannot be sure when that is.

@drob-xx
Copy link
Author

drob-xx commented Aug 30, 2022

@MaartenGr Thanks for the response. I understand that your time is limited. There always seems to be a lot of questions about how to reduce the number of uncategorized HDBSCAN results. Besides greatly improving that, this approach is also very helpful in determining topic number sizes for a given corpus. The intent of the code I've produced is to encapsulate quite a bit of functionality and complexity to make the process as simple as possible.

However, there is one area that is problematic due to the architecture of BERTopic. As you are well aware each run of UMAP will produce slightly different results. I've been drilling down on just how different they are - but the bottom line is that a given HDBSCAN set of parameters optimized for one UMAP run will approximate, but not as optimally, the settings for another UMAP run. The result of all this is that the only way I can figure out how to drive a given set of HDBSCAN parameters back into BERTopic is to make calls using the internal, non-public interface to BERTopic, never a great idea.

As far as I can tell there is no way to both run HDBSCAN but stop UMAP from re-running using the public interfaces. It would be great if this was configurable sometime in the future.

@MaartenGr
Copy link
Owner

However, there is one area that is problematic due to the architecture of BERTopic. As you are well aware each run of UMAP will produce slightly different results. I've been drilling down on just how different they are - but the bottom line is that a given HDBSCAN set of parameters optimized for one UMAP run will approximate, but not as optimally, the settings for another UMAP run. The result of all this is that the only way I can figure out how to drive a given set of HDBSCAN parameters back into BERTopic is to make calls using the internal, non-public interface to BERTopic, never a great idea.

As far as I can tell there is no way to both run HDBSCAN but stop UMAP from re-running using the public interfaces. It would be great if this was configurable sometime in the future.

You can calculate the umap_embeddings before passing them to BERTopic and then define an EmptyUMAP class that essentially does nothing except returning those embeddings. You can find a bit more about that here. I believe that process makes optimization of something like HDBSCAN, or any cluster model, a bit easier (depending on the evaluation metric).

@drob-xx
Copy link
Author

drob-xx commented Oct 19, 2022

Just FYI I have completed what I think is a reasonably functionally complete version of TopicTuner - a thin class that can be used to tune HDBSCAN for BERTopic. This version allows you to export a BERTopic instance once you have arrived at reasonable parameters. I will cross post in Discussions.

@MaartenGr
Copy link
Owner

Thank you for sharing! Hopefully, this helps users reduce the outliers in their HDBSCAN models. I hope to find some time to explore this in the upcoming weeks.

@drob-xx
Copy link
Author

drob-xx commented Oct 24, 2022

Tuning HDBSCAN will dramatically alter the number of outliers from default or user guessed parameters from everything I've seen. I would love to see examples where it didn't. I provided a simple demo which shouldn't take more than 10 mins. or so to run through on a colab gpu instance. Please let me know if anything I've written is not clear.

@MaartenGr
Copy link
Owner

I just ran the provided notebook and the results look interesting. It definitely seems like a nice way for optimizing HDBSCAN based on the number of outliers that you might or might not want! I especially like the options for visualizing and interacting with the results. I think it's important to allow for as much human evaluation as possible.

After playing around with visualizeEmbeddings, I am wondering about the risks of minimizing the number of outliers. Intuitively, decreasing outliers may result in documents ending up in topics where they might not belong. Which, from a quick exploration, seems to happen when we decrease outliers significantly.

Similarly, tuning the model this way opens up a number of other evaluation metrics that users might want to look into. For example, what is the effect of reducing outliers on the topics themselves? What is their result on things like topic coherence, human evaluation, or even the quality of the clusters? Modularity then also comes into play as is one of the main drivers behind BERTopic with respect to the choice of models, evaluation metrics, etc.

Also, after running the code below, I got an error:

from topictuner import TopicModelTuner as TMT
from sklearn.datasets import fetch_20newsgroups

docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']

tmt = TMT()
tmt.createEmbeddings(docs)
tmt.reduce()

lastRunResultsDF = tmt.randomSearch([*range(120,180)], [.1, .25, .5, .75, 1])

tmt.visualizeSearch(lastRunResultsDF).show()

tmt.summarizeResults(lastRunResultsDF).sort_values(by=['number_uncategorized'])  # This is where I get the error below

The error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_250/2549367941.py in <module>
----> 1 tmt.summarizeResults(lastRunResultsDF).sort_values(by=['number_uncategorized'])

/kaggle/working/TopicTuner/topictuner.py in summarizeResults(self, summaryDF)
    372       searches run for this model.
    373       '''
--> 374       if summaryDF == None :
    375         summaryDF = self.ResultsDF
    376       resultSummaryDF = pd.DataFrame()

/opt/conda/lib/python3.7/site-packages/pandas/core/generic.py in __nonzero__(self)
   1536     def __nonzero__(self):
   1537         raise ValueError(
-> 1538             f"The truth value of a {type(self).__name__} is ambiguous. "
   1539             "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
   1540         )

ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

@drob-xx
Copy link
Author

drob-xx commented Nov 10, 2022

@MaartenGr

I just ran the provided notebook and the results look interesting.

Cool. Thanks for taking a look.

It definitely seems like a nice way for optimizing HDBSCAN based on the number of outliers that you might or might not want!

In what I've written so far I've focused on outlier optimization. However, I've found this very useful for number of topics selection as well.

I am wondering about the risks of minimizing the number of outliers. Intuitively, decreasing outliers may result in documents ending up in topics where they might not belong.

I think this is a very important issue. So far I've found almost nothing in the literature regarding HDBSCAN tuning beyond the official doco. One exception is How To Cluster In High Dimensions where the author (in the summary at the end) states:

However, the HDBSCAN stands out here as an algorithm with only one hyperparameter which is easy enough to optimize via minimization of the amount of unassigned cells.

So beyond this I don't think I've seen much on how you objectively determine whether an HDBSCAN cluster is good or not without labeled data to compare it to. You wrote:

Which, from a quick exploration, seems to happen when we decrease outliers significantly.

Can you point me to specific examples where you are seeing this? I really haven't been able to identify any cases like this - but the reason I've been pushing on this is because I'm really interested to see what others are experiencing.

Similarly, tuning the model this way opens up a number of other evaluation metrics that users might want to look into.

Yes, I started down this path trying to figure out evaluation metrics for topic modeling. You turned me onto OCTIS some time ago and I played with a series of metrics but found them all wanting (that's another long discussion 😃 ). On my (long list) of things to look into is how this all plays against the established metrics. I just finished a preliminary dive into UMAP instability (#831) and would be interested in your take on that - there is a lot of overlap.

What is their result on things like topic coherence, human evaluation, or even the quality of the clusters?

How would you measure "quality of the clusters"? I'm anxious to see examples where minimizing outliers has a deleterious effect.

Modularity then also comes into play as is one of the main drivers behind BERTopic with respect to the choice of models, evaluation metrics, etc.

Yes, you've been very clear on this. For right now I've opted to write a complementary package. My next goal is to figure out whether there is any traction with this approach (beyond my own interest). If there is I will go back and revisit ways of more tightly integrating with BERTopic if that makes sense. I got away from the scikit-learn estimator paradigm. Partly because I didn't understand it when I started working on this. But also because their approach doesn't work all that well with tightly correlated parameters like min_cluster_size and sample_size, where sample_size is directly related to min_cluster_size. I believe it would be fairly trivial to implement those interfaces for what I'm doing. However, I realize that you have concerns beyond HDBSCAN in terms of these issues.

Also, after running the code below, I got an error:

Fixed now 😃.

I completed a version of TopicTuner a couple of weeks ago that allows a complete "BERTopic round trip" out of the box and since then have added some convenience features and cleaned up the interface a bit. I just finished off a round on UMAP instability #831 and I'm not sure where my wanderings will take me next. It'd be great to work on stuff that others would benefit from as I continue to scratch my own itches. Cheers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants