### Part 2: Utilizing the Vector Database with an Open Source LLM Model via LlamaCPP
**Introduction:**  
In this part, we will utilized the vectorDB we created in Part 1 to answer questions based on the documents inside.  

In [1]:

"""
  Installing LlamaCPP can be a bit tricky
See the full installation guide for GPU support at 'https://github.com/abetlen/llama-cpp-python'  
"""
# Basic Windows install
#!pip install llama-cpp-python  #basic install 

# MacOS: use the version below:
# !pip install llama_cpp_python==0.2.12


"\n  Installing LlamaCPP can be a bit tricky\nSee the full installation guide for GPU support at 'https://github.com/abetlen/llama-cpp-python'  \n"

In [2]:
from llama_index import VectorStoreIndex, ServiceContext
from llama_index.vector_stores import ChromaVectorStore
from llama_index.storage.storage_context import StorageContext
import chromadb


In [3]:
import torch 
# Detect hardware acceleration device
if torch.cuda.is_available():
    device = 'cuda'
    gpu_layers = 50
elif torch.backends.mps.is_available():  # Assuming MPS backend exists
    device = 'mps'
    gpu_layers = 1
else:
    device = 'cpu'
    gpu_layers = 0

print(f'Using device: {device}')

Using device: mps


### 1. Load the Foundational LLM via LlamaCPP and ask a question
Import the Foundation model form HuggingFace  
* If this is your first time it can take up to 10 min
* Currently using GGUF version of [Mistral-11B-OmniMix](https://huggingface.co/TheBloke/Mistral-11B-OmniMix-GGUF) with 4-bit Quantization 
* Hyperparams are set in the config

In [None]:
from llama_index.llms import LlamaCPP

model_url = 'https://huggingface.co/TheBloke/Mistral-11B-OmniMix-GGUF/resolve/main/mistral-11b-omnimix-bf16.Q4_K_M.gguf'


llm = LlamaCPP(
    # We can pass the URL to a GGUF model to download it 
    model_url=model_url,
    model_path=None,
    temperature=0.0,
    max_new_tokens=256,
    context_window=3900,
    generate_kwargs={},
    model_kwargs={'n_gpu_layers': gpu_layers},
    verbose=False,
)


### Default Prompt:
* The Default prompt is the prompt that the user's {question} is injected into

In [5]:
default_prompt = """
    You are an AI assistant who is always happy and helpful.
    Your answers must be appropriate for a 1st grade classroom, so no controversial topics or answers.
    Please answer the following user question:
    {question}

    Please answer that question thinking step by step
    Answer:
    """

#### Sample Logic Question
No RAG Used

In [6]:
user_question = 'There are 3 birds in a nest, 2 fly away and then 3 eggs hatch, how many birds are there now?'

full_question = default_prompt.format(question=user_question)
print(f'Final Prompt: {full_question}\n')
print('Model Answer:')
streaming_response = llm.stream_complete(full_question)
for token in streaming_response:
    print(token.delta, end='', flush=True)

Final Prompt: 
    You are an AI assistant who is always happy and helpful.
    Your answers must be appropriate for a 1st grade classroom, so no controversial topics or answers.
    Please answer the following user question:
    There are 3 birds in a nest, 2 fly away and then 3 eggs hatch, how many birds are there now?

    Please answer that question thinking step by step
    Answer:
    

Model Answer:
1. First, we have to count the number of birds that were originally in the nest. We know that there were 3 birds in the nest.
    2. Then, two of those birds flew away. So, now we have 3 - 2 = 1 bird left in the nest.
    3. Finally, three eggs hatched and became baby birds. So, we add these new birds to the one that was already there: 1 + 3 = 4.
    4. Therefore, now there are 4 birds in the nest.

### 2. Use the LLM with RAG from VectorDB
For RAG you need two models
* A LLM model (loaded above)
* A Embedding model, to embed the user question into a vector for the vector Data Base (DB) Search
* Since we used the BGE small model in the creation of the DB, we **must** import that same embedding model

In [7]:
from llama_index.indices.postprocessor import SentenceEmbeddingOptimizer
from llama_index.prompts  import PromptTemplate
from llama_index.llms import ChatMessage, MessageRole
from llama_index.chat_engine.condense_question import CondenseQuestionChatEngine

In [8]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
# Choose the same embedding model that used in the creation of the vector DB
embed_model_name = 'BAAI/bge-small-en-v1.5'
embed_model = HuggingFaceEmbedding(
    model_name=embed_model_name,
    device = device,
    normalize='True' # since we normalized vectors when we created the DB we must do it here
    )


In [9]:
# Load the RAG_VectorDB created in Part 1 from disk
db = chromadb.PersistentClient(path='./RAG_VectorDB')

chroma_collection = db.get_collection('arxiv_PDF_DB')

vector_store = ChromaVectorStore(chroma_collection=chroma_collection)

In [10]:
# We can retrieve our metadata
chroma_collection.metadata

{'Included Papers': 'Language Models are Few-Shot Learners',
 'embedding_used': 'BAAI/bge-small-en-v1.5'}

In [11]:
print(chroma_collection.metadata['embedding_used'])
if embed_model_name != chroma_collection.metadata['embedding_used']:
    raise Warning('Not using the same embedding model!')

BAAI/bge-small-en-v1.5


In [12]:
service_context = ServiceContext.from_defaults(embed_model=embed_model,
                                               llm=llm,
                                               )

storage_context = StorageContext.from_defaults(vector_store=vector_store)

index = VectorStoreIndex.from_vector_store(
    vector_store,
    service_context=service_context,
    storage_context = storage_context
)

In [13]:
from llama_index.prompts import Prompt

In [14]:
default_prompt_with_context = (
    """
    You are a "PaperBot", an AI assistant for answering questions about a arXiv paper. Assume all questions you receive are about this paper.
    Please limit your answers to the information provided in the "Context:"

    Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.
    Context: {context_str}

    Use that context to answer the following question about the paper.
    Keep your answer concise.
    Question: {query_str}
    Answer: """)
    
qa_template = Prompt(default_prompt_with_context)

##### Query with RAG
Now we will ask a question and the following steps will happen:
1. User question is turned into a vector 
2. That question vector is then compared to the vectors in our VectorDB
3. The page_context of best "k" matches are returned as "summaries" 
4. We then pass the summaries and non vectorized user question into the default_prompt_with_context


In [15]:
# percentile_cutoff: a measure for using the top percentage of relevant sentences.
query_engine = index.as_query_engine(streaming=True, similarity_top_k = 2, text_qa_template=qa_template,
    node_postprocessors=[SentenceEmbeddingOptimizer(percentile_cutoff=0.2, embed_model=embed_model)]
)

In [16]:
streaming_response = query_engine.query('What did the paper prove?')
streaming_response.print_response_stream()

print('\nSource:')
for source in streaming_response.metadata.values():
    print(f' {source["source"]}, page: {source["page"]}')


    The paper proved that language models can learn new tasks with only a few examples, without needing large amounts of labeled data.
Source:
 Language Models are Few-Shot Learners, page: 51
 Language Models are Few-Shot Learners, page: 56


Now we can answer questions from our pdf.  
However, the model has no memory of the conversation, as seen in the example below:

In [17]:
# Lacks Conversational Memory
streaming_response = query_engine.query('What did I just ask you?')
streaming_response.print_response_stream() # Will hallucinate the answer


    Please provide a specific question related to the given context.

### 3. Conversational Memory with RAG and Sources
Order of operations depends on when the question is asked.
* If it is the first time the user asks a question. Then their exact question is put into the default prompt

* For every prompt after that first question the procedure is as follows:
    1. Use the condense_question_prompt to input chat history and the users followup question to generate a Standalone question
        * This Standalone question rephrases the users question in context of the chat history
    2. Pass the Standalone question into the default prompt along with the RAG data
    
#### Key Takeaway: For follow up questions the LLM is used twice

In [18]:
custom_prompt = PromptTemplate("""\
Your objective is to take in the USER QUESTION and add additional context (especially Nouns) from the CHAT HISTORY
rephrase the user question to be a Standalone Question by combining it with the relevant CHAT HISTORY.
The question is always about the arXiv paper, do not modify acronyms.

<CHAT HISTORY>
{chat_history}
                               
<USER QUESTION>
{question}


<Standalone question>
""")

# custom_chat_history: list of ChatMessage objects
custom_chat_history = []

chat_engine = CondenseQuestionChatEngine.from_defaults(
    query_engine=query_engine,
    embed_model=embed_model,
    service_context = service_context,
    condense_question_prompt=custom_prompt,
    chat_history=custom_chat_history,
    verbose=True,
)


In [19]:
# First question, just ask query_engine directly 
chat_engine.reset()
question ='How did they describe zero-shot?'

streaming_response = chat_engine._query_engine.query(question)
# streaming_response = query_engine.query(question)
streaming_response.print_response_stream()

print('\nSource:')
for v in streaming_response.metadata.values():
    print(f' {v["source"]}, page: {v["page"]}')


# Need to manually append history on first question since we used query_engine instead of chat_engine for first question
chat_engine.chat_history.append(
    ChatMessage(
        role=MessageRole.USER,
        content = question
    )
 
)
chat_engine.chat_history.append(
    ChatMessage(
    role=MessageRole.ASSISTANT,
    content = streaming_response.response_txt
    )
)

 They described zero-shot as a setting where no demonstrations are provided, and the model only receives a natural language instruction describing the task.
Source:
 Language Models are Few-Shot Learners, page: 7
 Language Models are Few-Shot Learners, page: 4


In [20]:
print(chat_engine.chat_history)

[ChatMessage(role=<MessageRole.USER: 'user'>, content='How did they describe zero-shot?', additional_kwargs={}), ChatMessage(role=<MessageRole.ASSISTANT: 'assistant'>, content=' They described zero-shot as a setting where no demonstrations are provided, and the model only receives a natural language instruction describing the task.', additional_kwargs={})]


In [21]:
streaming_response = chat_engine.stream_chat('How does that compare to Few-Shot?')
streaming_response.print_response_stream()

print('\nSource:')
for node in streaming_response.sources[0].raw_output.source_nodes:
    print(f' {node.metadata["source"]}, page: {node.metadata["page"]}')
    #print(node.score) # similarity score


Querying with: Can you please elaborate on how zero-shot learning differs from few-shot learning in the context of the given paper?
 In the context of the paper, zero-shot learning and few-shot learning differ primarily in the amount of demonstration data provided to the model during inference.
    In zero-shot learning, no demonstrations are given, and the model only has access to a natural language instruction describing the task. This method is challenging but potentially more robust and generalizable since it avoids spurious correlations that may arise from specific examples.
    On the other hand, in few-shot learning, the model is provided with a small set of demonstrations (usually around 10-20) to learn from. This method strikes a balance between performance and sample efficiency, allowing the model to learn new tasks without requiring large amounts of data.
Source:
 Language Models are Few-Shot Learners, page: 7
 Language Models are Few-Shot Learners, page: 4


In [22]:
print(chat_engine.chat_history)
chat_engine.reset() # clears chat history

[ChatMessage(role=<MessageRole.USER: 'user'>, content='How did they describe zero-shot?', additional_kwargs={}), ChatMessage(role=<MessageRole.ASSISTANT: 'assistant'>, content=' They described zero-shot as a setting where no demonstrations are provided, and the model only receives a natural language instruction describing the task.', additional_kwargs={}), ChatMessage(role=<MessageRole.USER: 'user'>, content='How does that compare to Few-Shot?', additional_kwargs={}), ChatMessage(role=<MessageRole.ASSISTANT: 'assistant'>, content=' In the context of the paper, zero-shot learning and few-shot learning differ primarily in the amount of demonstration data provided to the model during inference.\n    In zero-shot learning, no demonstrations are given, and the model only has access to a natural language instruction describing the task. This method is challenging but potentially more robust and generalizable since it avoids spurious correlations that may arise from specific examples.\n    On