Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zero-Shot #1982

Open
sucduit opened this issue May 9, 2024 · 2 comments
Open

Zero-Shot #1982

sucduit opened this issue May 9, 2024 · 2 comments

Comments

@sucduit
Copy link

sucduit commented May 9, 2024

I have a question about Zero-Shot. I used Zero -shot BERTOPIC to do topic mining for my dissertation. I need to explain in more detail about the process. In the case, zero-shot and HDBSCAN are initiated concurrently or Zero-shot classification precede HDBSCAN clustering? I asked GPT4, at first, it said do HDBSCAN first and then use Zeroshot to label the document. then I give the flowchart to GPT4, it said do Zero-shot first and then HDBSCAN. Then I asked a few questions and GPT4 said look like "Simultaneous Processing Paths: zero shot and HDBSCAN as two paths. So if you could provide more detailed explanation about the process I will appreciate it very much since the committee may ask such questions. Thanks again.

@MaartenGr
Copy link
Owner

As a general tip, GPT-4, albeit an amazing LLM, is not necessarily the best tool for fact-based information even if you supply it with the source material. As you noticed, there is a risk that GPT-4 gives the wrong answer but that is does not realize it. When it comes to facts, I would advise always checking the source material first as it is important to be able to read the docs as well as the underlying code.

Having said that, you can find more about the technique in the documentation:

This method works as follows. First, we create a number of labels for our predefined topics and embed them using any embedding model. Then, we compare the embeddings of the documents with the predefined labels using cosine similarity. If they pass a user-defined threshold, the zero-shot topic is assigned to a document. If it does not, then that document, along with others, will be put through a regular BERTopic model.

In other words, the zero-shot topics are assigned first and precede the HDBSCAN clustering. Then, both models are merged.

@sucduit
Copy link
Author

sucduit commented May 10, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants