Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KeyBERT with llm and embedding model: endless loop during extract_keywords #190

Closed
SolanaO opened this issue Dec 1, 2023 · 4 comments
Closed

Comments

@SolanaO
Copy link

SolanaO commented Dec 1, 2023

Cannot get any output if I use both llm and model for embeddings in KeyBERT. It runs more than 10 min on an input of 10-15 tokens (arxiv paper title for example). I have no issues with KeyBERT if I don't specify the embeddings model. I use a Google Colab Pro with A100 or V100 GPU and high_RAM.

Sample code, based on the Medium blogpost and the Github code, I omitted imports, prompt and docs samples for clarity:

model_name_or_path = "TheBloke/zephyr-7B-beta-GPTQ"
llm = AutoModelForCausalLM.from_pretrained(model_name_or_path,
                                             device_map="auto",
                                             trust_remote_code=False,
                                             revision="main")

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)

generator = pipeline(
    model=llm,
    tokenizer=tokenizer,
    task='text-generation',
    max_new_tokens=50,
    repetition_penalty=1.1
)

# Load it in KeyLLM
llm_tg = TextGeneration(generator, prompt=prompt)

# THIS GETS STUCK
kw_model= KeyBERT(llm=llm_tg, model='BAAI/bge-small-en-v1.5')
keywords = kw_model.extract_keywords(docs, threshold=0.5)

# THIS WORKS WELL
key_model = KeyBERT(llm=llm_tg)
keywords = key_model.extract_keywords(docs, threshold=0.5)
@MaartenGr
Copy link
Owner

Strange, not what I would expect. Could you share the full code? That actually makes it easier for me to reproduce the issue and perhaps find out what is going wrong here.

Before I test things out, it might be related to the threshold parameter. Since your two use cases use different embedding models, they generally cannot use the exact same threshold. Perhaps increasing that value will speed things up.

@SolanaO
Copy link
Author

SolanaO commented Dec 1, 2023

First, thank you for your prompt reply. I include the full code below. But I did a few experiments, and yes, the issue is related to the threshold. Now, here is the interesting thing. I tested several thresholds, going down from 0.7 which works. Keywords are generated for threshold 0.57 (about 2-3 sec for the docs as in the sample below), but not for threshold 0.56. It's a strange intersting fact!

!pip3 install transformers optimum accelerate
!pip3 install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/  
!pip3 install keybert
!pip install -U sentence-transformers


from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from keybert.llm import TextGeneration
from keybert import KeyLLM, KeyBERT
from sentence_transformers import SentenceTransformer


prompt = """
<|system|>
I have the following document:
- The website mentions that it only takes a couple of days to deliver but I still have not received mine.

You are an useful assistant that extract the keywords that are present in a document and separate them with commas.
Make sure to only return the keywords and say nothing else. For example, don't say:
"Here are the keywords present in the document"</s>

<|user|>
Please give me the keywords that are present in the following document:
- [DOCUMENT]

You are an useful assistant that extract the keywords that are present in a document and separate them with commas.
Make sure to only return the keywords and say nothing else. For example, don't say:
"Here are the keywords present in the document"</s>

<|assistant|>
"""

docs = ['Von Neumann Quantum Logic vs. Classical von Neumann Architecture?',
 'Minimum Description Length and Compositionality',
 'Effect of different packet sizes on RED performance',
 'RED behavior with different packet sizes',
 'Predicting the expected behavior of agents that learn about agents: the\n  CLRI framework']


model_name_or_path = "TheBloke/zephyr-7B-beta-GPTQ"
llm = AutoModelForCausalLM.from_pretrained(model_name_or_path,
                                             device_map="auto",
                                             trust_remote_code=False,
                                             revision="main")

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)

generator = pipeline(
    model=llm,
    tokenizer=tokenizer,
    task='text-generation',
    max_new_tokens=50,
    repetition_penalty=1.1
)


# Load it in KeyLLM
llm_tg = TextGeneration(generator, prompt=prompt)

# GETS STUCK for threshold <=0.56, works for threshold 0.57 or greater
kw_model= KeyBERT(llm=llm_tg, model='BAAI/bge-small-en-v1.5')
keywords = kw_model.extract_keywords(docs, threshold=0.5)

# WORKS WELL
key_model = KeyBERT(llm=llm_tg)
keywords = key_model.extract_keywords(docs, threshold=0.5) 

@MaartenGr
Copy link
Owner

First, thank you for your prompt reply. I include the full code below. But I did a few experiments, and yes, the issue is related to the threshold. Now, here is the interesting thing. I tested several thresholds, going down from 0.7 which works. Keywords are generated for threshold 0.57 (about 2-3 sec for the docs as in the sample below), but not for threshold 0.56. It's a strange intersting fact!

Thanks for sharing the code that currently works, it will definitely help those having the same issue! That's definitely interesting. It seems that that particular embedding model has a specific distribution of similarities when applying cosine similarity.

@SolanaO
Copy link
Author

SolanaO commented Dec 1, 2023

Intriguing indeed! Thanks again for your help!

@SolanaO SolanaO closed this as completed Dec 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants