Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request: Zeroshot option to assign unassigned documents to outliers rather than reclustering #1958

Open
zilch42 opened this issue Apr 30, 2024 · 3 comments

Comments

@zilch42
Copy link
Contributor

zilch42 commented Apr 30, 2024

Hi Maarten,

I have a use case at the moment where I'm using zero shot topic modeling to assign documents to a list of known clusters. I'm not really interested in finding other unknown clusters in the data, but I do know that there will be some documents that don't match anything and I would just like them to be outliers.

At the moment, the workflow for zeroshot is that any documents that dont match a zeroshot topic to a certain threshold go into a pool to be run through the standard bertopic pipeline. That's useful in some cases, but not others. One issue that I have encountered is that if there are only a few docs that don't fit into a topic (e.g. 4), UMAP can't handle it and produces an error (the same error in #1900 when I was trying to visualise only 4 topics).

Could we have an option in zeroshot to determine where to direct documents that fall below zeroshot_min_similarity? Either to outliers or to reclustering?

Cheers

@MaartenGr
Copy link
Owner

Thanks for the suggestion!

I want to prevent adding any parameter to the init of BERTopic as that would further complicate using the model. Having said that, I think you can already do what you suggested as follows:

  1. Train a zero-shot topic model and set zeroshot_min_similarity=0 to make sure that all documents are assigned to the zero-shot topics. This will prevent clustering.
  2. Use the resulting probabilities (.probabilities_) to select only the documents that exceed your specified threshold. So the threshold you would normally use in zeroshot topic modeling. Retain the topic label of topics that exceed the threshold, set the label of topics that do not exceed this threshold to -1. In essence, you are creating .topics_.
  3. Finally, use manual BERTopic to model your newly created topics.

@zilch42
Copy link
Contributor Author

zilch42 commented May 2, 2024

Thanks Maarten,

That's more or less what I'm doing at the moment, except that zeroshot doesn't actually assign the probabilities so topic_model.probabilities_ is nan so I'm recalculating the zeroshot topic embeddings and the cosine similarities myself. That's not a big deal as it doesn't take long, but it would make sense for the max cosine similarity to be saved in probabilities_ as that is basically what they are. Its probably a one liner to add if you'd like a PR.

@MaartenGr
Copy link
Owner

That's more or less what I'm doing at the moment, except that zeroshot doesn't actually assign the probabilities so topic_model.probabilities_ is nan so I'm recalculating the zeroshot topic embeddings and the cosine similarities myself.

Are you using .transform for that? That way, you wouldn't have to do anything outside of BERTopic.

That's not a big deal as it doesn't take long, but it would make sense for the max cosine similarity to be saved in probabilities_ as that is basically what they are. Its probably a one liner to add if you'd like a PR.

Not sure if I understand what you mean. Do you mean calculating the probabilities already during zero-shot topic modeling? That should indeed be straightforward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants