<a href="https://colab.research.google.com/github/Alfred9/Exploring-LLMs/blob/main/Keyword%20Extraction%20with%20Mistral%207B/Keyword_Extraction_with_Mistral_7B.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Keyword Extraction with Mistral 7B**





 Keyword extraction is a fundamental task in natural language processing (NLP) that plays a crucial role in understanding and analyzing textual data efficiently. By identifying and extracting the most relevant terms from text, we can streamline information retrieval, summarization, and categorization processes. In this notebook, we will delve into keyword extraction techniques, with a focus on leveraging the Mistral 7B model.

Let's start by installing a number of packages that we are going to use throughout this example:

In [1]:
%%capture
!pip install --upgrade git+https://github.com/UKPLab/sentence-transformers
!pip install keybert ctransformers[cuda]
!pip install --upgrade git+https://github.com/huggingface/transformers

In [2]:
from ctransformers import AutoModelForCausalLM

# Set gpu_layers to the number of layers to offload to GPU. Set to 0 if no GPU acceleration is available on your system.
model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Mistral-7B-Instruct-v0.1-GGUF",
    model_file="mistral-7b-instruct-v0.1.Q4_K_M.gguf",
    model_type="mistral",
    gpu_layers=50,
    hf=True
)

Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]


Now, let's proceed with setting up a huggingface Transformers pipeline.

In [3]:
from transformers import AutoTokenizer, pipeline

# Tokenizer
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")

# Pipeline
generator = pipeline(
    model=model, tokenizer=tokenizer,
    task='text-generation',
    max_new_tokens=50,
    repetition_penalty=1.1
)

# 📄 **Prompt Engineering**


Let's see if this works with a very basic example:

In [4]:
response = generator("List 5 interesting facts about Spain?")
print(response[0]["generated_text"])

List 5 interesting facts about Spain?

1. Spain is the fourth most populous country in Europe, with a population of over 46 million people.
2. The official language of Spain is Spanish, also known as Castilian. However, there are several regional


Perfect! It can handle a very basic question. For the purpose of keyword extraction, let's explore whether it can handle a bit more complexity.

In [5]:
prompt = """
I have the following document:
* Successful immunotherapy relies on triggering complex responses involving T cell dynamics in tumors and the periphery. Characterizing these responses remains challenging using static human single-cell atlases or mouse models.

Extract 5 keywords from that document.
"""
response = generator(prompt)
print(response[0]["generated_text"])


I have the following document:
* Successful immunotherapy relies on triggering complex responses involving T cell dynamics in tumors and the periphery. Characterizing these responses remains challenging using static human single-cell atlases or mouse models.

Extract 5 keywords from that document.

**Answer:**
1. Immunotherapy
2. T cells
3. Dynamics
4. Triggering
5. Responses


In [6]:
example_prompt = """
<s>[INST]
I have the following document:
- Successful immunotherapy relies on triggering complex responses involving T cell dynamics in tumors and the periphery. Characterizing these responses remains challenging using static human single-cell atlases or mouse models..

Please give me the keywords that are present in this document and separate them with commas.
Make sure you to only return the keywords and say nothing else. For example, don't say:
"Here are the keywords present in the document"
[/INST] meat, beef, eat, eating, emissions, steak, food, health, processed, chicken</s>"""

In [7]:
keyword_prompt = """
[INST]

I have the following document:
- [DOCUMENT]

Please give me the keywords that are present in this document and separate them with commas.
Make sure you to only return the keywords and say nothing else. For example, don't say:
"Here are the keywords present in the document"
[/INST]
"""

In [8]:
prompt = example_prompt + keyword_prompt
print(prompt)


<s>[INST]
I have the following document:
- Successful immunotherapy relies on triggering complex responses involving T cell dynamics in tumors and the periphery. Characterizing these responses remains challenging using static human single-cell atlases or mouse models..

Please give me the keywords that are present in this document and separate them with commas.
Make sure you to only return the keywords and say nothing else. For example, don't say:
"Here are the keywords present in the document"
[/INST] meat, beef, eat, eating, emissions, steak, food, health, processed, chicken</s>
[INST]

I have the following document:
- [DOCUMENT]

Please give me the keywords that are present in this document and separate them with commas.
Make sure you to only return the keywords and say nothing else. For example, don't say:
"Here are the keywords present in the document"
[/INST]



In [9]:
from keybert.llm import TextGeneration
from keybert import KeyLLM

# Load it in KeyLLM
llm = TextGeneration(generator, prompt=prompt)
kw_model = KeyLLM(llm)

In [10]:
documents = [
"Successful immunotherapy relies on triggering complex responses involving T cell dynamics in tumors and the periphery. Characterizing these responses remains challenging using static human single-cell atlases or mouse models."
]

keywords = kw_model.extract_keywords(documents); keywords

[['Successful',
  'Immunotherapy',
  'Triggering',
  'Complex',
  'Responses',
  'T cell',
  'Dynamics',
  'Tumors',
  'Periphery',
  'Characterizing',
  'Static',
  'Human',
  'Single-cell',
  'Atlases',
  'Mouse',
  'Models']]

In [11]:
from keybert import KeyLLM
from sentence_transformers import SentenceTransformer

# Extract embeddings
model = SentenceTransformer('BAAI/bge-small-en-v1.5')
embeddings = model.encode(documents, convert_to_tensor=True)

In [12]:
# Load it in KeyLLM
kw_model = KeyLLM(llm)

# Extract keywords
keywords = kw_model.extract_keywords(documents, embeddings=embeddings, threshold=.5)

In [13]:
keywords

[['Successful',
  'Immunotherapy',
  'Triggering',
  'Complex',
  'Responses',
  'T cell',
  'Dynamics',
  'Tumors',
  'Periphery',
  'Characterizing',
  'Static',
  'Human',
  'Single-cell',
  'Atlases',
  'Mouse',
  'Models']]

In [14]:
from keybert import KeyLLM, KeyBERT

# Load it in KeyLLM
kw_model = KeyBERT(llm=llm, model='BAAI/bge-small-en-v1.5')

# Extract keywords
keywords = kw_model.extract_keywords(documents, threshold=.5)

In [15]:
keywords

[['Successful',
  'Immunotherapy',
  'Triggering',
  'Complex',
  'Responses',
  'T cell',
  'Dynamics',
  'Tumors',
  'Periphery',
  'Characterizing',
  'Static',
  'Human',
  'Single-cell',
  'Atlases',
  'Mouse',
  'Models']]