# 使用KeyLLM和KeyBert进行关键词抽取

## 安装相关库

In [None]:
!pip install --upgrade git+https://github.com/UKPLab/sentence-transformers
!pip install keybert ctransformers[cuda]
!pip install --upgrade git+https://github.com/huggingface/transformers

## 加载模型
加载模型并卸载模型50层到GPU，这样会减少RAM的使用，转而使用VRAM。如果遇到内存错误，可以继续减少此参数（gpu_layers）。

In [None]:
from ctransformers import AutoModelForCausalLM

# Set gpu_layers to the number of layers to offload to GPU. 
# Set to 0 if no GPU acceleration is available on your system.
model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Mistral-7B-Instruct-v0.1-GGUF",
    model_file="mistral-7b-instruct-v0.1.Q4_K_M.gguf",  # model_file表示模型文件的路径
    model_type="mistral",
    gpu_layers=50,
    hf=True
)

使用sentence-transformers加载完模型之后，我们就可以继续使用transformers库来构建pipeline，包括tokenizer。

In [None]:
from transformers import AutoTokenizer, pipeline

# Tokenizer
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")

# Pipeline
generator = pipeline(
    model=model, 
    tokenizer=tokenizer,
    task='text-generation',
    max_new_tokens=50,
    repetition_penalty=1.1  # 该参数用于控制生成文本的多样性
)

In [None]:
prompt = """
I have the following document:
* The website mentions that it only takes a couple of days to deliver but I still have not received mine

Extract 5 keywords from that document.
"""
response = generator(prompt)
print(response[0]["generated_text"])

## 丰富提示词以获取更优质的输出

In [None]:
example_prompt = """
<s>[INST]
I have the following document:
- The website mentions that it only takes a couple of days to deliver but I still have not received mine.

Please give me the keywords that are present in this document and separate them with commas.
Make sure you to only return the keywords and say nothing else. For example, don't say:
"Here are the keywords present in the document"
[/INST] meat, beef, eat, eating, emissions, steak, food, health, processed, chicken</s>"""

In [None]:

keyword_prompt = """
[INST]
I have the following document:
- [DOCUMENT]

Please give me the keywords that are present in this document and separate them with commas.
Make sure you to only return the keywords and say nothing else. For example, don't say:
"Here are the keywords present in the document"
[/INST]
"""

In [None]:
prompt = example_prompt + keyword_prompt

In [None]:
from keybert.llm import TextGeneration
from keybert import KeyLLM

# Load it in KeyLLM
llm = TextGeneration(generator, prompt=prompt)
kw_model = KeyLLM(llm)

In [None]:
documents = [
"The website mentions that it only takes a couple of days to deliver but I still have not received mine.",
"I received my package!",
"Whereas the most powerful LLMs have generally been accessible only through limited APIs (if at all), Meta released LLaMA's model weights to the research community under a noncommercial license."
]

keywords = kw_model.extract_keywords(documents)

## 更高效使用KeyLLM抽取关键词
- 首先embedding所有文档，并将它们转换为数字表示；
- 其次，找出哪些文档彼此最相似，假设高度相似的文档将具有相同的关键字，因此不需要为所有文档提取关键字。
- 第三，只从每个聚类中的一个文档中提取关键字，并将关键字分配给同一聚类中的所有文档。

In [None]:
from keybert import KeyLLM
from sentence_transformers import SentenceTransformer

# Extract embeddings
model = SentenceTransformer('BAAI/bge-small-en-v1.5')
embeddings = model.encode(documents, convert_to_tensor=True)

# Load it in KeyLLM
kw_model = KeyLLM(llm)

# Extract keywords
keywords = kw_model.extract_keywords(
    documents, 
    embeddings=embeddings, 
    threshold=.5
)

threshold增加到大约.95将识别几乎相同的文档，而将其设置为大约.5将识别关于相同主题的文档。

上述是通过embedding对文本相似度进行判断，下面是通过keybert对关键词的提取完成对文本相似度的判断

In [None]:
from keybert import KeyLLM, KeyBERT

# Load it in KeyLLM
kw_model = KeyBERT(llm=llm, model='BAAI/bge-small-en-v1.5')

# Extract keywords
keywords = kw_model.extract_keywords(documents, threshold=0.5)