**We** will start by installing a number of packages that we are going to use

We will start by installing a number of packages that we are going to use

In [None]:
%%capture
!pip install --upgrade git+https://github.com/UKPLab/sentence-transformers
!pip install keybert ctransformers[cuda] #install qunatized models
!pip install --upgrade git+https://github.com/huggingface/transformers

### **Loading Mixtral 8x7B**

In [None]:
!pip show ctransformers


Name: ctransformers
Version: 0.2.27
Summary: Python bindings for the Transformer models implemented in C/C++ using GGML library.
Home-page: https://github.com/marella/ctransformers
Author: Ravindra Marella
Author-email: mv.ravindra007@gmail.com
License: MIT
Location: /usr/local/lib/python3.10/dist-packages
Requires: huggingface-hub, py-cpuinfo
Required-by: 


We will use the most cutting edge open source model by [Mistralai](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1).

### Here we could also use larger open source models for better outputs as https://huggingface.co/TheBloke/Mixtral-8x7B-v0.1-GGUF and also paid models as gpt4. But I have use a smaller model for faster outputs.the documentation of them is given on https://maartengr.github.io/KeyBERT/guides/llms.html#openai

In [None]:
from ctransformers import AutoModelForCausalLM

# Set gpu_layers to the number of layers to offload to GPU. Set to 0 if no GPU acceleration is available on your system.
model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Mistral-7B-Instruct-v0.1-GGUF",
    model_file="mistral-7b-instruct-v0.1.Q4_K_M.gguf",
    model_type="mistral",
    gpu_layers=50,
    hf=True
)
#more layers we put on the gpu the more vram it will use

Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

Creating Transformers pipeline

In [None]:
from transformers import AutoTokenizer, pipeline

# Tokenizer
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")

# Pipeline
generator = pipeline(
    model=model, tokenizer=tokenizer,
    task='text-generation',
    max_new_tokens=50,
    repetition_penalty=1.1
)

tokenizer_config.json:   0%|          | 0.00/1.47k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

## **Prompt Engineering**

In [None]:
response = generator("What is 8+2?")
print(response[0]["generated_text"])


What is 8+2?
A: 10


it can handle a simple example

In [None]:
prompt = """
I have the following document:
* The website mentions that it only takes a couple of days to deliver but I still have not received mine

Extract 5 keywords from that document.
"""
response = generator(prompt)
print(response[0]["generated_text"])


I have the following document:
* The website mentions that it only takes a couple of days to deliver but I still have not received mine

Extract 5 keywords from that document.

**Answer:**
1. Website
2. Mentions
3. Deliver
4. Couple
5. Days


It does okay but This is where more advanced prompt engineering comes in. As with most Large Language Models, Mistral 7B expects a certain prompt format. This is tremendously helpful when we want to show it what a "correct" interaction looks like.

The prompt template is as follows:

<br>
<div>
<img src="https://github.com/MaartenGr/KeyBERT/assets/25746895/aba167b1-93e6-44ab-a39b-4aab85c858c0" width="850"/>
</div>

It needs to have two components:
* `Example prompt` - This will be used to show the LLM what a "good" output looks like
* `Keyword prompt` - This will be used to ask the LLM to extract the keywords

In [None]:
example_prompt = """
<s>[INST]
I have the following document:
- The website mentions that it only takes a couple of days to deliver but I still have not received mine.

Please give me the keywords that are present in this document and separate them with commas.
Make sure you to only return the keywords and say nothing else. For example, don't say:
"Here are the keywords present in the document"
[/INST] meat, beef, eat, eating, emissions, steak, food, health, processed, chicken</s>"""

The second component, the keyword_prompt, will essentially be a repeat of the example_prompt but with two changes:

It will not have an output yet. That will be generated by the LLM.
We use of KeyBERT's [DOCUMENT] tag for indicating where the input document will go.
We can use the [DOCUMENT] to insert a document at a location of your choice. Having this option helps us to change the structure of the prompt if needed without being set on having the prompt at a specific location.

In [None]:
keyword_prompt = """
[INST]

I have the following document:
- [DOCUMENT]

Please give me the keywords that are present in this document and separate them with commas.
Make sure you to only return the keywords and say nothing else. For example, don't say:
"Here are the keywords present in the document"
[/INST]
"""

Lastly, we combine the two prompts to create our final template:

In [None]:
prompt = example_prompt + keyword_prompt
print(prompt)


<s>[INST]
I have the following document:
- The website mentions that it only takes a couple of days to deliver but I still have not received mine.

Please give me the keywords that are present in this document and separate them with commas.
Make sure you to only return the keywords and say nothing else. For example, don't say:
"Here are the keywords present in the document"
[/INST] meat, beef, eat, eating, emissions, steak, food, health, processed, chicken</s>
[INST]

I have the following document:
- [DOCUMENT]

Please give me the keywords that are present in this document and separate them with commas.
Make sure you to only return the keywords and say nothing else. For example, don't say:
"Here are the keywords present in the document"
[/INST]



Keyword extraction with KeyLLM ; we simply ask it to extract keywords from a document.


This idea of extracting keywords from documents through an LLM is straightforward and allows for easily testing your LLM and its capabilities.

Using KeyLLM is straightforward, we start by loading our LLM throught keybert.llm.TextGeneration and give it the prompt template that we created before.

In [None]:
from keybert.llm import TextGeneration
from keybert import KeyLLM

# Load it in KeyLLM
llm = TextGeneration(generator, prompt=prompt)
kw_model = KeyLLM(llm)

After preparing KeyLLM instance:

In [None]:
documents = [
"Massive flocks of greater and lesser flamingos are often associated with the saline and alkaline lakes of Kenya and Tanzania. While greater flamingos can inhabit both saltwater and freshwater habitats, lesser flamingos are found in saline waters, and the species is considered “near threatened” by the International Union for Conservation of Nature. India has the largest population of lesser flamingos outside the African continent, mostly in the salt deserts of the western state of Gujarat. There are few historical records of flamingos in Mumbai; one from 1891 suggests they were an occasional bird of passage in the region.",
]

keywords = kw_model.extract_keywords(documents); keywords

[['flamingos',
  'Kenya',
  'Tanzania',
  'saline',
  'alkaline',
  'lakes',
  'greater',
  'lesser',
  'flamingos',
  'saltwater',
  'freshwater',
  'habitats',
  'near threatened',
  'International Union for Conservation of Nature',
  'India',
  '']]

# 🏆 Efficient Keyword Extraction with `KeyBERT` & `KeyLLM`

Before, we manually passed the embeddings to KeyLLM to essentially do a zero-shot extraction of keywords. We can further extend this example by leveraging KeyBERT.

Since KeyBERT generates keywords and embeds the documents, we can leverage that to not only simplify the pipeline but suggest a number of keywords to the LLM.

These suggested keywords can help the LLM decide on the keywords to use. Moreover, it allows for everything within KeyBERT to be used with KeyLLM!

so for the keybert and keyllm
 - <img src="https://github.com/MaartenGr/BERTopic/assets/25746895/01b4b831-7dd3-4ea9-be81-6dff4cc9a32b" width="450"/>



This efficient keyword extraction with both `KeyBERT` and `KeyLLM`\. We create a KeyBERT model and assign it the LLM with the embedding model we previously created:

In [None]:
from keybert import KeyLLM, KeyBERT

# Load it in KeyLLM
kw_model = KeyBERT(llm=llm, model='BAAI/bge-small-en-v1.5')

# Extract keywords
keywords = kw_model.extract_keywords(documents, threshold=.5)

In [None]:
keywords

[['flamingos',
  'Kenya',
  'Tanzania',
  'saline',
  'alkaline',
  'lakes',
  'greater',
  'lesser',
  'flamingos',
  'saltwater',
  'freshwater',
  'habitats',
  'near threatened',
  'International Union for Conservation of Nature',
  'India',
  '']]