<a href="https://colab.research.google.com/github/DanielHolzwart/RAG-Question-answering-with-Mistral7b/blob/main/Question_Answering_with_Mistral7b_and_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
%%capture
#optimum optimize and accelerates the deployment of transformer models
!pip install optimum
#the model we are going to load (Mistral-7b) is quantized which is why we need the aut-gptq dependency
!pip install auto-gptq
#runs the model with lower precision to save memory
!pip install bitsandbytes

Next we are going to load the TheBloke/Mistral-7B-Instruct-v0.2-GPTQ model. Mistral-7b is a 7 bilion parameter LLM desinged for diverse applications. "Instruct" means that is has already been finetuned for instruction-following tasks, meaning it's optimized to generate responses to specific instructions or questions. The Q in GPTQ stand for quantized version.


In [3]:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

For the setting in the following code:
*   device_map="auto" automatically assigns model layers to availabe hardware (Hugging face's accelerate library is used under the hood for this)
*   trust_remote_code=False secrurity measurement to prevent execution of possibly unsafe code
*   revision="main" takes the main branch of the model https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GPTQ






In [4]:
model_name = "TheBloke/Mistral-7B-Instruct-v0.2-GPTQ"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    trust_remote_code=False,
    revision="main")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.08k [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/4.16G [00:00<?, ?B/s]

Some weights of the model checkpoint at TheBloke/Mistral-7B-Instruct-v0.2-GPTQ were not used when initializing MistralForCausalLM: ['model.layers.0.mlp.down_proj.bias', 'model.layers.0.mlp.gate_proj.bias', 'model.layers.0.mlp.up_proj.bias', 'model.layers.0.self_attn.k_proj.bias', 'model.layers.0.self_attn.o_proj.bias', 'model.layers.0.self_attn.q_proj.bias', 'model.layers.0.self_attn.v_proj.bias', 'model.layers.1.mlp.down_proj.bias', 'model.layers.1.mlp.gate_proj.bias', 'model.layers.1.mlp.up_proj.bias', 'model.layers.1.self_attn.k_proj.bias', 'model.layers.1.self_attn.o_proj.bias', 'model.layers.1.self_attn.q_proj.bias', 'model.layers.1.self_attn.v_proj.bias', 'model.layers.10.mlp.down_proj.bias', 'model.layers.10.mlp.gate_proj.bias', 'model.layers.10.mlp.up_proj.bias', 'model.layers.10.self_attn.k_proj.bias', 'model.layers.10.self_attn.o_proj.bias', 'model.layers.10.self_attn.q_proj.bias', 'model.layers.10.self_attn.v_proj.bias', 'model.layers.11.mlp.down_proj.bias', 'model.layers.11

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

Initiate the Tokenizer of Mistral7b

In [5]:
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

tokenizer_config.json:   0%|          | 0.00/1.46k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

A quick test queston to see the output. Notice that the prompt for Mistral7b must start and end with [INST] and [/INST], respectively.

In [6]:
# model in evaluation mode (dropout modules are deactivated)
model.eval()

# craft prompt
comment = "How are you doing, Mistral7b?"
prompt=f'''[INST] {comment} [/INST]'''

# tokenize input
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(input_ids=inputs["input_ids"].to("cuda"), max_new_tokens=140, attention_mask = inputs.attention_mask.cuda(),pad_token_id=model.config.eos_token_id) #we specify the pad_token_id to not get warning messages. As we are dealing with a text generation problem, the pad_token_id is not neseccarily needed

In [7]:
print(tokenizer.decode(outputs[0]))

<s> [INST] How are you doing, Mistral7b? [/INST] I'm just a computer program, I don't have the ability to feel emotions or do physical activities. I'm here to help answer any questions you might have to the best of my ability. Is there something specific you'd like to ask about?</s>


Let's make the output look more nicely.

In [8]:
import textwrap
def print_mistral(text):
    index_inst1 = text.find("[INST]")
    index_inst2 = text.find("[/INST]")

    prompt = text[(index_inst1+len("[/INST]")):index_inst2]
    #end of string
    index_s = text.find("</s>")
    text_truncated = text[(index_inst2+len("[/INST]")):index_s]

    question = f'Question: {prompt}'
    answer = 'Answer:' + '\n'.join(textwrap.wrap(text_truncated,width=80))

    return print(f'{question} \n\n', answer)

print_mistral(tokenizer.decode(outputs[0]))

Question: How are you doing, Mistral7b?  

 Answer: I'm just a computer program, I don't have the ability to feel emotions or do
physical activities. I'm here to help answer any questions you might have to the
best of my ability. Is there something specific you'd like to ask about?


The model seems to work. Let us ask a more difficult question: "What is RAG in terms of LLms?"   

In [9]:
comment = "What is RAG in terms of LLMs?"
prompt=f'''[INST] {comment} [/INST]'''

# tokenize input
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(input_ids=inputs["input_ids"].to("cuda"), max_new_tokens=140, attention_mask = inputs.attention_mask.cuda(), pad_token_id=model.config.eos_token_id)

In [10]:
print_mistral(tokenizer.decode(outputs[0]))

Question: What is RAG in terms of LLMs?  

 Answer: In the context of Legal Project Management and Law Firms, RAG stands for Red,
Amber, Green. It is a system used to indicate the status or health of a legal
matter or project.  * Red (R) indicates that there is a significant problem or
risk that requires immediate attention. * Amber (A) indicates that there is a
potential problem or risk that needs to be monitored closely. * Green (G)
indicates that the matter or project is progressing as planned.  The RAG system
is used to help law firms and legal teams manage their caseloads effectively and
to identify and address potential issues before they become major problem


This is certainly not the answer we anticipated even though we gave the model not only a question "What is RAG" but also the context "in terms of LLMs". Evidently, queston answer fine-tuning the model on machine learning books will likely produce better answers. However, instead of fine-tuning, we will use RAG to retrieve relevant informations from the book "Natural Language Processing with Transformers" (highly recomended) and ask the model the same question while also providing details on the context. Esentially, we provide the model with answers and all he has to do it wrap it up nicely and give a precise answer.

For retrieving we are going to use llama-index. Following the scheme of the famous 5-liner (https://docs.llamaindex.ai/en/stable/)

      from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

      documents = SimpleDirectoryReader("data").load_data()
      index = VectorStoreIndex.from_documents(documents)
      query_engine = index.as_query_engine()
      response = query_engine.query("Some question about the data should go here")
      print(response)

allows us to efficiently retrieve contexts. Here, data is the folder containing relevant documents.


In [11]:
%%capture
!pip install llama-index
!pip install llama-index-embeddings-huggingface

In [12]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings, SimpleDirectoryReader, VectorStoreIndex
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.postprocessor import SimilarityPostprocessor

The default embedding model is OpenAI embedding model. Retrieving that model via Settings.embed_model results in an error message as we are working without an API key.

In [13]:
Settings.embed_model
Settings.llm

ValueError: 
******
Could not load OpenAI embedding model. If you intended to use OpenAI, please check your OPENAI_API_KEY.
Original error:
No API key found for OpenAI.
Please set either the OPENAI_API_KEY environment variable or openai.api_key prior to initialization.
API keys can be found or created at https://platform.openai.com/account/api-keys

Consider using embed_model='local'.
Visit our documentation for more embedding options: https://docs.llamaindex.ai/en/stable/module_guides/models/embeddings.html#modules
******

In [14]:
# import any embedding model on HF hub
Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

#by default, the llm is an Open AI model. But we want to use Mistral7b
Settings.llm = None
Settings.chunk_size = 256
Settings.chunk_overlap = 25

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/94.8k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

LLM is explicitly disabled. Using MockLLM.


The next step is to define the path, store the document in the SimpleDirectoryReader and index them

In [15]:
import os
cv_path = os.getcwd() + '/drive/My Drive/2024-09-26 RAG with Mistral7b/NLP with Transformers'
documents = SimpleDirectoryReader(cv_path).load_data()
index = VectorStoreIndex.from_documents(documents)

In [16]:
# set number of docs to retreive
top_k = 3

# configure retriever
retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=top_k,
)

In [17]:
# assemble query engine
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    node_postprocessors=[SimilarityPostprocessor(similarity_cutoff=0.5)],
)

Having everything set up, let us ask the same question as above but not providing the very general context "in terms of LLMs" but with the book nodes.

In [18]:
query = "What is RAG?"
response = query_engine.query(query)

In [19]:
# reformat response. Every output block is a snippet of the book
context = "Context:\n"
for i in range(top_k):
    context = context + response.source_nodes[i].text + "\n\n"

print(context)

Context:
There are two types of RAG models to choose from:
RAG-Sequence
Uses the same retrieved document to generate the complete answer. In particular,
the top k documents from the retriever are fed to the generator, which produces
an output sequence for each document, and the result is marginalized to obtain
the best answer.
Going Beyond Extractive QA | 205

Let’s now give RAG a spin by feeding in some queries about the Amazon Fire tablet
from before. To simplify the querying, we’ll write a simple function that takes the
query and prints out the top answers:
def generate_answers (query, top_k_generator =3):
    preds = pipe.run(query=query, top_k_generator =top_k_generator ,
                     top_k_retriever =5, filters={"item_id" :["B0074BW614" ]})
    print(f"Question: {preds['query']} \n")
    for idx in range(top_k_generator ):
        print(f"Answer {idx+1}: {preds['answers'][idx]['answer']}" )
OK, now we’re ready to give it a test:
generate_answers (query)
Question: Is it go

In [20]:
# craft prompt
comment = "What is RAG in terms of LLMs?"
prompt=f'''[INST] {comment} Answer this question by using the following context: {context} [/INST]'''

# tokenize input
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(input_ids=inputs["input_ids"].to("cuda"), max_new_tokens=140, attention_mask = inputs.attention_mask.cuda(),pad_token_id=model.config.eos_token_id)

Note that the following ouput looks a bit messy because it also print the context. However, the answer is much better now!

In [21]:
print_mistral(tokenizer.decode(outputs[0]))

Question: What is RAG in terms of LLMs? Answer this question by using the following context: Context:
There are two types of RAG models to choose from:
RAG-Sequence
Uses the same retrieved document to generate the complete answer. In particular,
the top k documents from the retriever are fed to the generator, which produces
an output sequence for each document, and the result is marginalized to obtain
the best answer.
Going Beyond Extractive QA | 205

Let’s now give RAG a spin by feeding in some queries about the Amazon Fire tablet
from before. To simplify the querying, we’ll write a simple function that takes the
query and prints out the top answers:
def generate_answers (query, top_k_generator =3):
    preds = pipe.run(query=query, top_k_generator =top_k_generator ,
                     top_k_retriever =5, filters={"item_id" :["B0074BW614" ]})
    print(f"Question: {preds['query']} \n")
    for idx in range(top_k_generator ):
        print(f"Answer {idx+1}: {preds['answers'][idx]['an