<a href="https://colab.research.google.com/github/Alfred9/Exploring-LLMs/blob/main/Keyword%20Extraction%20with%20Mistral%207B/Keyword_Extraction_with_Mistral_7B.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Introducing `KeyLLM`: Keyword Extraction with Mistral 7B**




---
        
 **NOTE**: We will want to use a GPU to run both Llama2 as well as KeyBERT for this use case. In Google Colab, go to
**Runtime > Change runtime type > Hardware accelerator > GPU > GPU type > T4**.

---

Let's start by installing a number of packages that we are going to use throughout this example:

In [None]:
%%capture
!pip install --upgrade git+https://github.com/UKPLab/sentence-transformers
!pip install keybert ctransformers[cuda]
!pip install --upgrade git+https://github.com/huggingface/transformers

In [None]:
from ctransformers import AutoModelForCausalLM

# Set gpu_layers to the number of layers to offload to GPU. Set to 0 if no GPU acceleration is available on your system.
model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Mistral-7B-Instruct-v0.1-GGUF",
    model_file="mistral-7b-instruct-v0.1.Q4_K_M.gguf",
    model_type="mistral",
    gpu_layers=50,
    hf=True
)

Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

After having loaded the model itself, we want to create a 🤗 Transformers pipeline.

The main benefit of doing so is that these pipelines are found in many tutorials and are often used in packages as backend. Thus far, `ctransformers` is not yet natively supported as much as `transformers`.

Loading the Mistral tokenizer with `ctransformers` is not yet possible as the model is quite new. Instead, we use the tokenizer from the original repository instead.

In [None]:
from transformers import AutoTokenizer, pipeline

# Tokenizer
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")

# Pipeline
generator = pipeline(
    model=model, tokenizer=tokenizer,
    task='text-generation',
    max_new_tokens=50,
    repetition_penalty=1.1
)

# 📄 **Prompt Engineering**


Let's see if this works with a very basic example:

In [None]:
response = generator("What is 5+1?")
print(response[0]["generated_text"])

What is 5+1?
A: 6


Perfect! It can handle a very basic question. For the purpose of keyword extraction, let's explore whether it can handle a bit more complexity.

In [None]:
prompt = """
I have the following document:
* The website mentions that it only takes a couple of days to deliver but I still have not received mine

Extract 5 keywords from that document.
"""
response = generator(prompt)
print(response[0]["generated_text"])


I have the following document:
* The website mentions that it only takes a couple of days to deliver but I still have not received mine

Extract 5 keywords from that document.

**Answer:**
1. Website
2. Mentions
3. Deliver
4. Couple
5. Days


In [None]:
example_prompt = """
<s>[INST]
I have the following document:
- The website mentions that it only takes a couple of days to deliver but I still have not received mine.

Please give me the keywords that are present in this document and separate them with commas.
Make sure you to only return the keywords and say nothing else. For example, don't say:
"Here are the keywords present in the document"
[/INST] meat, beef, eat, eating, emissions, steak, food, health, processed, chicken</s>"""

In [None]:
keyword_prompt = """
[INST]

I have the following document:
- [DOCUMENT]

Please give me the keywords that are present in this document and separate them with commas.
Make sure you to only return the keywords and say nothing else. For example, don't say:
"Here are the keywords present in the document"
[/INST]
"""

In [None]:
prompt = example_prompt + keyword_prompt
print(prompt)


<s>[INST]
I have the following document:
- The website mentions that it only takes a couple of days to deliver but I still have not received mine.

Please give me the keywords that are present in this document and separate them with commas.
Make sure you to only return the keywords and say nothing else. For example, don't say:
"Here are the keywords present in the document"
[/INST] meat, beef, eat, eating, emissions, steak, food, health, processed, chicken</s>
[INST]

I have the following document:
- [DOCUMENT]

Please give me the keywords that are present in this document and separate them with commas.
Make sure you to only return the keywords and say nothing else. For example, don't say:
"Here are the keywords present in the document"
[/INST]



In [None]:
from keybert.llm import TextGeneration
from keybert import KeyLLM

# Load it in KeyLLM
llm = TextGeneration(generator, prompt=prompt)
kw_model = KeyLLM(llm)

In [None]:
documents = [
"The website mentions that it only takes a couple of days to deliver but I still have not received mine.",
"I received my package!",
"Whereas the most powerful LLMs have generally been accessible only through limited APIs (if at all), Meta released LLaMA's model weights to the research community under a noncommercial license."
]

keywords = kw_model.extract_keywords(documents); keywords

[['website',
  'mention',
  'days',
  'deliver',
  'receive',
  'coupler',
  'still',
  'have',
  'not',
  'received',
  'mine.'],
 ['package',
  'received',
  'delivery',
  'shipment',
  'order',
  'product',
  'item',
  'box',
  'mail',
  'courier'],
 ['LLM',
  'API',
  'accessibility',
  'release',
  'license',
  'research',
  'community',
  'model',
  'weights',
  'Meta',
  'power',
  'availability',
  'commercial',
  'noncommercial',
  'language',
  'models',
  'development',
  'collaboration',
  'innovation',
  'openness',
  'sharing',
  'knowledge',
  'resources']]

In [None]:
from keybert import KeyLLM
from sentence_transformers import SentenceTransformer

# Extract embeddings
model = SentenceTransformer('BAAI/bge-small-en-v1.5')
embeddings = model.encode(documents, convert_to_tensor=True)

In [None]:
# Load it in KeyLLM
kw_model = KeyLLM(llm)

# Extract keywords
keywords = kw_model.extract_keywords(documents, embeddings=embeddings, threshold=.5)

In [None]:
keywords

[['website',
  'mention',
  'days',
  'deliver',
  'receive',
  'coupler',
  'still',
  'have',
  'not',
  'received',
  'mine.'],
 ['website',
  'mention',
  'days',
  'deliver',
  'receive',
  'coupler',
  'still',
  'have',
  'not',
  'received',
  'mine.'],
 ['LLM',
  'API',
  'accessibility',
  'release',
  'license',
  'research',
  'community',
  'model',
  'weights',
  'Meta',
  'power',
  'availability',
  'commercial',
  'noncommercial',
  'language',
  'models',
  'development',
  'collaboration',
  'innovation',
  'openness',
  'sharing',
  'knowledge',
  'resources']]

In [None]:
from keybert import KeyLLM, KeyBERT

# Load it in KeyLLM
kw_model = KeyBERT(llm=llm, model='BAAI/bge-small-en-v1.5')

# Extract keywords
keywords = kw_model.extract_keywords(documents, threshold=.5)

In [None]:
keywords

[['website',
  'mention',
  'days',
  'deliver',
  'receive',
  'coupler',
  'still',
  'have',
  'not',
  'received',
  'mine.'],
 ['website',
  'mention',
  'days',
  'deliver',
  'receive',
  'coupler',
  'still',
  'have',
  'not',
  'received',
  'mine.'],
 ['LLM',
  'API',
  'accessibility',
  'release',
  'license',
  'research',
  'community',
  'model',
  'weights',
  'Meta',
  'power',
  'availability',
  'commercial',
  'noncommercial',
  'language',
  'models',
  'development',
  'collaboration',
  'innovation',
  'openness',
  'sharing',
  'knowledge',
  'resources']]

In [None]:
!jupyter nbconvert --to markdown Keyword_Extraction_with_Mistral_7B.ipynb

This application is used to convert notebook files (*.ipynb)
        to various other formats.


Options
The options below are convenience aliases to configurable class-options,
as listed in the "Equivalent to" description-line of the aliases.
To see all configurable class-options for some <cmd>, use:
    <cmd> --help-all

--debug
    set log level to logging.DEBUG (maximize logging output)
    Equivalent to: [--Application.log_level=10]
--show-config
    Show the application's configuration (human-readable format)
    Equivalent to: [--Application.show_config=True]
--show-config-json
    Show the application's configuration (json format)
    Equivalent to: [--Application.show_config_json=True]
--generate-config
    generate default config file
    Equivalent to: [--JupyterApp.generate_config=True]
-y
    Answer yes to any questions instead of prompting.
    Equivalent to: [--JupyterApp.answer_yes=True]
--execute
    Execute the notebook prior to export.
    Equivalent to: [--ExecutePr