KeyBERT with llm and embedding model: endless loop during extract_keywords #190

SolanaO · 2023-12-01T15:53:28Z

Cannot get any output if I use both llm and model for embeddings in KeyBERT. It runs more than 10 min on an input of 10-15 tokens (arxiv paper title for example). I have no issues with KeyBERT if I don't specify the embeddings model. I use a Google Colab Pro with A100 or V100 GPU and high_RAM.

Sample code, based on the Medium blogpost and the Github code, I omitted imports, prompt and docs samples for clarity:

model_name_or_path = "TheBloke/zephyr-7B-beta-GPTQ"
llm = AutoModelForCausalLM.from_pretrained(model_name_or_path,
                                             device_map="auto",
                                             trust_remote_code=False,
                                             revision="main")

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)

generator = pipeline(
    model=llm,
    tokenizer=tokenizer,
    task='text-generation',
    max_new_tokens=50,
    repetition_penalty=1.1
)

# Load it in KeyLLM
llm_tg = TextGeneration(generator, prompt=prompt)

# THIS GETS STUCK
kw_model= KeyBERT(llm=llm_tg, model='BAAI/bge-small-en-v1.5')
keywords = kw_model.extract_keywords(docs, threshold=0.5)

# THIS WORKS WELL
key_model = KeyBERT(llm=llm_tg)
keywords = key_model.extract_keywords(docs, threshold=0.5)

The text was updated successfully, but these errors were encountered:

MaartenGr · 2023-12-01T16:14:39Z

Strange, not what I would expect. Could you share the full code? That actually makes it easier for me to reproduce the issue and perhaps find out what is going wrong here.

Before I test things out, it might be related to the threshold parameter. Since your two use cases use different embedding models, they generally cannot use the exact same threshold. Perhaps increasing that value will speed things up.

SolanaO · 2023-12-01T17:10:50Z

First, thank you for your prompt reply. I include the full code below. But I did a few experiments, and yes, the issue is related to the threshold. Now, here is the interesting thing. I tested several thresholds, going down from 0.7 which works. Keywords are generated for threshold 0.57 (about 2-3 sec for the docs as in the sample below), but not for threshold 0.56. It's a strange intersting fact!

!pip3 install transformers optimum accelerate
!pip3 install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/  
!pip3 install keybert
!pip install -U sentence-transformers


from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from keybert.llm import TextGeneration
from keybert import KeyLLM, KeyBERT
from sentence_transformers import SentenceTransformer


prompt = """
<|system|>
I have the following document:
- The website mentions that it only takes a couple of days to deliver but I still have not received mine.

You are an useful assistant that extract the keywords that are present in a document and separate them with commas.
Make sure to only return the keywords and say nothing else. For example, don't say:
"Here are the keywords present in the document"</s>

<|user|>
Please give me the keywords that are present in the following document:
- [DOCUMENT]

You are an useful assistant that extract the keywords that are present in a document and separate them with commas.
Make sure to only return the keywords and say nothing else. For example, don't say:
"Here are the keywords present in the document"</s>

<|assistant|>
"""

docs = ['Von Neumann Quantum Logic vs. Classical von Neumann Architecture?',
 'Minimum Description Length and Compositionality',
 'Effect of different packet sizes on RED performance',
 'RED behavior with different packet sizes',
 'Predicting the expected behavior of agents that learn about agents: the\n  CLRI framework']


model_name_or_path = "TheBloke/zephyr-7B-beta-GPTQ"
llm = AutoModelForCausalLM.from_pretrained(model_name_or_path,
                                             device_map="auto",
                                             trust_remote_code=False,
                                             revision="main")

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)

generator = pipeline(
    model=llm,
    tokenizer=tokenizer,
    task='text-generation',
    max_new_tokens=50,
    repetition_penalty=1.1
)


# Load it in KeyLLM
llm_tg = TextGeneration(generator, prompt=prompt)

# GETS STUCK for threshold <=0.56, works for threshold 0.57 or greater
kw_model= KeyBERT(llm=llm_tg, model='BAAI/bge-small-en-v1.5')
keywords = kw_model.extract_keywords(docs, threshold=0.5)

# WORKS WELL
key_model = KeyBERT(llm=llm_tg)
keywords = key_model.extract_keywords(docs, threshold=0.5)

MaartenGr · 2023-12-01T18:03:34Z

First, thank you for your prompt reply. I include the full code below. But I did a few experiments, and yes, the issue is related to the threshold. Now, here is the interesting thing. I tested several thresholds, going down from 0.7 which works. Keywords are generated for threshold 0.57 (about 2-3 sec for the docs as in the sample below), but not for threshold 0.56. It's a strange intersting fact!

Thanks for sharing the code that currently works, it will definitely help those having the same issue! That's definitely interesting. It seems that that particular embedding model has a specific distribution of similarities when applying cosine similarity.

SolanaO · 2023-12-01T18:25:46Z

Intriguing indeed! Thanks again for your help!

SolanaO closed this as completed Dec 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KeyBERT with llm and embedding model: endless loop during extract_keywords #190

KeyBERT with llm and embedding model: endless loop during extract_keywords #190

SolanaO commented Dec 1, 2023 •

edited

Loading

MaartenGr commented Dec 1, 2023

SolanaO commented Dec 1, 2023

MaartenGr commented Dec 1, 2023

SolanaO commented Dec 1, 2023

KeyBERT with llm and embedding model: endless loop during extract_keywords #190

KeyBERT with llm and embedding model: endless loop during extract_keywords #190

Comments

SolanaO commented Dec 1, 2023 • edited Loading

MaartenGr commented Dec 1, 2023

SolanaO commented Dec 1, 2023

MaartenGr commented Dec 1, 2023

SolanaO commented Dec 1, 2023

SolanaO commented Dec 1, 2023 •

edited

Loading