New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Request: Zeroshot option to assign unassigned documents to outliers rather than reclustering #1958
Comments
Thanks for the suggestion! I want to prevent adding any parameter to the init of BERTopic as that would further complicate using the model. Having said that, I think you can already do what you suggested as follows:
|
Thanks Maarten, That's more or less what I'm doing at the moment, except that zeroshot doesn't actually assign the probabilities so |
Are you using
Not sure if I understand what you mean. Do you mean calculating the probabilities already during zero-shot topic modeling? That should indeed be straightforward. |
Hi Maarten,
I have a use case at the moment where I'm using zero shot topic modeling to assign documents to a list of known clusters. I'm not really interested in finding other unknown clusters in the data, but I do know that there will be some documents that don't match anything and I would just like them to be outliers.
At the moment, the workflow for zeroshot is that any documents that dont match a zeroshot topic to a certain threshold go into a pool to be run through the standard bertopic pipeline. That's useful in some cases, but not others. One issue that I have encountered is that if there are only a few docs that don't fit into a topic (e.g. 4), UMAP can't handle it and produces an error (the same error in #1900 when I was trying to visualise only 4 topics).
Could we have an option in zeroshot to determine where to direct documents that fall below
zeroshot_min_similarity
? Either to outliers or to reclustering?Cheers
The text was updated successfully, but these errors were encountered: