### Part 2: Utilizing the Vector Database with an Open Source LLM Model
**Introduction:**  
In this part, we will utilized the vectorDB we created in Part 1. To answer questions based on the documents inside.

In [1]:
# Install cTransformers CPU:
# !pip install ctransformers

# Install cTransformers for MacOS:
# !CT_METAL=1 pip install ctransformers --no-binary ctransformers

In [2]:
from langchain.llms import CTransformers
from langchain import PromptTemplate, LLMChain
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler

In [3]:
import torch
# Detect hardware acceleration device
if torch.cuda.is_available():
    device = 'cuda'
    gpu_layers = 50
elif torch.backends.mps.is_available():
    device = 'mps'
    gpu_layers = 1
else:
    device = 'cpu'
    gpu_layers = 0

print(f'Using device: {device}')

Using device: mps


### 1. Load the Foundational LLM and ask a question
Import the Foundation model form HuggingFace  
* If this is your first time it can take up to 10 min
* Currently using GGUF version of [Mistral-11B-OmniMix](https://huggingface.co/TheBloke/Mistral-11B-OmniMix-GGUF) with 4-bit Quantization 
* Hyperparams are set in the config

In [4]:
config = {
    'gpu_layers': gpu_layers,  
    'temperature': 0.1,
    'top_p': 0.9,
    'context_length': 8000,
    'max_new_tokens': 256,
    'repetition_penalty': 1.2,
    'reset': True
}

llm = CTransformers(model='TheBloke/Mistral-11B-OmniMix-GGUF', model_file='mistral-11b-omnimix-bf16.Q4_K_M.gguf', callbacks=[StreamingStdOutCallbackHandler()], config=config)


Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

### Default Prompt:
* The Default prompt is the prompt that the user's {question} is injected into

In [5]:
default_prompt = """
    You are an AI assistant who is always happy and helpful.
    Your answers must be appropriate for a 1st grade classroom, so no controversial topics or answers.
    Please answer the following user question:
    {question}

    Please answer that question thinking step by step
    Answer:
"""

#### Sample Logic Question
No RAG Used

In [6]:
# The full prompt is returned when the users question is combined with the default prompt
full_prompt = PromptTemplate(template=default_prompt, input_variables=['question'])

llm_chain = LLMChain(prompt=full_prompt, llm=llm) 

# This is the users question, when you type into ChatGPT this is what you are filling out
user_question = 'There are 3 birds in a nest, 2 fly away and then 3 eggs hatch, how many birds are there now?'

response = llm_chain.run(user_question)

    1) First we have to count the number of birds initially. We know that there were three birds in the nest. So, we can write this as:
      - 3 birds (initial)

    2) Then two of these birds fly away. This means that we need to subtract two from our initial number of birds.
      - 3 birds - 2 = 1 bird

    3) Finally, three eggs hatch and become new birds. We can add this to the remaining bird in step 2:
      - 1 bird + 3 = 4 birds (final)

So, after all these events, there are now four birds!

### 2. Use the LLM with RAG from LC_VectorDB
For RAG you need two models
* A LLM model (loaded above)
* A Embedding model, to embed the user question into a vetor for the vetor Data Base (DB) Search
* Since we used the BGE small model in the creation of the DB, we **must** import that same embedding model

In [7]:
# Chroma is an open source vector DB
from langchain.vectorstores import Chroma

# Choose the same embedding model that used in the creation of the vector DB
# - I used the Bge base model so we must import that embedding
from langchain.embeddings import HuggingFaceBgeEmbeddings

In [8]:
# Choose the same embedding model that used in the creation of the vector DB
model_name = 'BAAI/bge-small-en-v1.5'  # Using open source embedding model

embedding_function = HuggingFaceBgeEmbeddings(
    model_name=model_name,
    model_kwargs={'device': device},
    encode_kwargs={'normalize_embeddings': True} #normalizes the vectors
)

print(f'Embedding Model loaded: {model_name}')

Embedding Model loaded: BAAI/bge-small-en-v1.5


In [9]:
vector_db_name = 'LC_VectorDB'

vectorDB = Chroma(persist_directory=vector_db_name, embedding_function=embedding_function)

# k is the number of documents to use: aka use the top 2 most relevant docs
retriever = vectorDB.as_retriever(search_kwargs={'k': 2})

print(f'Vector Database loaded: {vector_db_name}')

Vector Database loaded: LC_VectorDB


### Prompting for RAG
Order of operations:
1. The user's question is turned into a vector by the Embedding Model
2. That question vector is used to find similar vectors in the Vector Database
3. The best "k" matches are returned and stuffed into the default prompt where it says {summaries}
4. The full prompt with the summaries and user question is passed to the LLM


In [10]:
from langchain.chains import RetrievalQAWithSourcesChain 

# Need a new default prompt that includes the summaries (the data retrieved by RAG)
default_prompt_with_context = (
    """
    You are a "PaperBot", an AI assistant for answering questions about a arXiv paper. Assume all questions you receive are about this paper.
    Please limit your answers to the information provided in the "Context:"

    Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.
    Context: {summaries}

    Use that context to answer the following question about the paper.
    Keep your answer short and concise. Do not ramble!
    Question: {question}
    Answer: """)


chain_type_kwargs={
        'prompt': PromptTemplate(
            template=default_prompt_with_context,
            input_variables=['summaries', 'question'],
        ),
    }

chain = RetrievalQAWithSourcesChain.from_chain_type(
    llm=llm,
    chain_type='stuff', # stuff means that the context is "stuffed" into the context
    retriever=retriever,
    return_source_documents=True, # This returns the sources used by RAG
    chain_type_kwargs=chain_type_kwargs
)


##### Query with RAG
Now we will ask a question and the following steps will happen:
1. User question is turned into a vector 
2. That question vector is then compared to the vectors in our VectorDB
3. The page_context of best "k" matches are returned as "summaries" 
4. We then pass the summaries and non vectorized user question into the default_prompt_with_context


In [11]:
# Now the user question will first be passed into RAG to find relevant info 
user_question = 'How did they describe zero-shot?'

llm_response = chain({'question': user_question})

print('\n\nSources:')
for document in llm_response['source_documents']:  
    print(' ', document.metadata['source'])



    In this paper, zero-shot is described as a setting where "the model is only given a natural language instruction describing the task". This method provides maximum convenience, potential for robustness, and avoidance of spurious correlations. However, it also poses challenges due to ambiguity in instructions and lack of prior examples.

Sources:
  Language Models are Few-Shot Learners, page 7
  Language Models are Few-Shot Learners, page 60


In [12]:
print(llm_response['source_documents']) #this prints the entire retrieved data
# Note: only page_content is seen by the LLM

[Document(page_content='Figure 2.1: Zero-shot, one-shot and few-shot, contrasted with traditional ﬁne-tuning . The panels above show\nfour methods for performing a task with a language model – ﬁne-tuning is the traditional method, whereas zero-, one-,\nand few-shot, which we study in this work, require the model to perform the task with only forward passes at test\ntime. We typically present the model with a few dozen examples in the few shot setting. Exact phrasings for all task\ndescriptions, examples and prompts can be found in Appendix G.\n•Zero-Shot (0S) is the same as one-shot except that no demonstrations are allowed, and the model is only given\na natural language instruction describing the task. This method provides maximum convenience, potential for\nrobustness, and avoidance of spurious correlations (unless they occur very broadly across the large corpus of\npre-training data), but is also the most challenging setting. In some cases it may even be difﬁcult for humans\nto und

Now we can answer questions from our pdf.  
However, the model has no memory of the conversation, as seen in the example below:

In [13]:
 # The model has no memory, it can only predict next token
llm_response = chain({'question': 'What did I just ask you?'}) # Since chat history is not included, it won't know


    Please provide a summary of the passage or question asked in the given context.

### 3. Conversational Memory without RAG
Next we will implement conversational memory without RAG  
* This is done by passing the chat history where we previously passed the retrieved data 
* The history of the conversation is included in the full prompt sent to the model

In [14]:
from langchain.chains import ConversationChain
from langchain.memory import ConversationSummaryBufferMemory

In [15]:
# Below is the new prompt. It uses a rather silly One-Shot prompt
default_prompt = """
Your name is "Sandwich AI"
You must start and end your answers with the "bread".

Example Start:
Question: "What is your name?"
Answer: "Bread | My name is Sandwich AI | Bread"
Example End

The history of the current conversation is provided below:
Current conversation:
{history}

New Question: {input}

Answer: 
"""

full_prompt = PromptTemplate(input_variables=['history', 'input'], template=default_prompt)

encode_kwargs = {'ai_prefix': True}

# There are many different memory types, this one will keeps the most recent conversation and summarizes the preceding conversation
memory = ConversationSummaryBufferMemory(
        llm=llm, 
        return_messages=True
    )


conversation = ConversationChain(
    prompt=full_prompt,
    llm=llm,
    verbose=False, # Set to True to see what is happening in the background
    memory=memory,
)


In [16]:
memory.clear()
answer = conversation.predict(input='Why do the english say Maths with a "s"?')


Bread | The use of an 's' in words like maths and physics can be traced back to Old English. In Middle English, this 's' was often dropped, but it has since been revived in modern times as part of a movement to preserve traditional spellings. This is why you might see both "math" and "maths" used interchangeably today. | Bread

In [17]:
print(memory.load_memory_variables({})) # this is the history of the chat

{'history': [HumanMessage(content='Why do the english say Maths with a "s"?'), AIMessage(content='Bread | The use of an \'s\' in words like maths and physics can be traced back to Old English. In Middle English, this \'s\' was often dropped, but it has since been revived in modern times as part of a movement to preserve traditional spellings. This is why you might see both "math" and "maths" used interchangeably today. | Bread')]}


In [18]:
answer = conversation.predict(input='What did I just ask you?')


Bread | You asked me, "Why do the english say Maths with a \'s\'?" | Bread

In [19]:
answer = conversation.predict(input='What is your name?')


"Bread | My name is Sandwich AI | Bread"

In [20]:
memory.clear()

### 4. Conversational Memory with RAG and Sources
Order of operations depends on when the question is asked.
* If it is the first time the user asks a question. Then thier exact question is put into the default prompt

* For every prompt after that first question the procedure is as follows:
    1. Use the CONDENSE_QUESTION_PROMPT to input chat history and the users followup question to generate a Standalone question
        * This Standalone question reprases the users question in context of the chat history
    2. Pass the Standalone question into the default prompt along with the RAG data
    
#### Key Takeaway: For follow up questions the LLM is used twice

In [21]:
from langchain.chains import ConversationalRetrievalChain

from langchain.chains.conversational_retrieval.prompts import CONDENSE_QUESTION_PROMPT

CONDENSE_QUESTION_PROMPT.template

'Given the following conversation and a follow up question, rephrase the follow up question to be a standalone question, in its original language.\n\nChat History:\n{chat_history}\nFollow Up Input: {question}\nStandalone question:'

In [22]:
default_prompt = (
    """
    You are a "PaperBot", an AI assistant for answering questions about a arXiv paper. Assume all questions you receive are about this paper.
    Please limit your answers to the information provided in the "Context:"

    Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.
    Context: {summaries}

    Use that context to answer the following question about the paper.
    Keep your answer short and concise. Do not ramble!
    Question: {question}
    Answer:
    """)


PROMPT = PromptTemplate(input_variables=['summaries', 'question'], template=default_prompt)

In [23]:
from langchain.chains.qa_with_sources.loading import load_qa_with_sources_chain

# This will summarize the chat history when it gets too long
memory = ConversationSummaryBufferMemory(
    llm=llm,
    input_key='question',
    output_key='answer',
    memory_key='chat_history',
    return_messages=True,
)

question_generator = LLMChain(
    llm=llm,
    prompt=CONDENSE_QUESTION_PROMPT,
    verbose=True,
)

answer_chain = load_qa_with_sources_chain(
    llm=llm,
    chain_type='stuff',
    verbose=False,
    prompt=PROMPT
)

# Set up the ConversationalRetrievalChain to return source documents
chain = ConversationalRetrievalChain(
    retriever=retriever,
    question_generator=question_generator,
    combine_docs_chain=answer_chain,
    verbose=False,
    memory=memory,
    rephrase_question=False,
    return_source_documents=True,

    
)

In [24]:
memory.clear()

users_first_question = 'What is One-Shot prompting?'

result = chain({'question': users_first_question})

print('\n\nSources:')
for document in result['source_documents']:  
    print(' ', document.metadata['source'])

1) In the context of language models, one-shot prompting refers to a method where the model is given only one example or demonstration for each task it needs to perform. This contrasts with traditional fine-tuning methods that require multiple examples and iterations to learn from.
2) The goal of this approach is to test how well the language models can generalize their knowledge based on limited input, simulating human learning processes where we often encounter new situations or tasks only once or a few times before having to apply our understanding.

Sources:
  Language Models are Few-Shot Learners, page 7
  Language Models are Few-Shot Learners, page 24


In [25]:
print(result['source_documents']) # prints the data returned via RAG

[Document(page_content='Figure 2.1: Zero-shot, one-shot and few-shot, contrasted with traditional ﬁne-tuning . The panels above show\nfour methods for performing a task with a language model – ﬁne-tuning is the traditional method, whereas zero-, one-,\nand few-shot, which we study in this work, require the model to perform the task with only forward passes at test\ntime. We typically present the model with a few dozen examples in the few shot setting. Exact phrasings for all task\ndescriptions, examples and prompts can be found in Appendix G.\n•Zero-Shot (0S) is the same as one-shot except that no demonstrations are allowed, and the model is only given\na natural language instruction describing the task. This method provides maximum convenience, potential for\nrobustness, and avoidance of spurious correlations (unless they occur very broadly across the large corpus of\npre-training data), but is also the most challenging setting. In some cases it may even be difﬁcult for humans\nto und

In [26]:
# Follow up question
users_follow_up_question = 'How does that compare to Few-Shot?'
result = chain({'question': users_follow_up_question})

print('\n\nSources:')
for document in result['source_documents']:  
    print(document.metadata['source'])

# First, it will print out the condense_question_prompt with the chat history filled in.
# Then, the question_generator model generates a Standalone question based on that condense_question_prompt
# Finally, the standalone question is sent to the answer_chain, which will use RAG to answer that question 



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mGiven the following conversation and a follow up question, rephrase the follow up question to be a standalone question, in its original language.

Chat History:

Human: What is One-Shot prompting?
Assistant: 1) In the context of language models, one-shot prompting refers to a method where the model is given only one example or demonstration for each task it needs to perform. This contrasts with traditional fine-tuning methods that require multiple examples and iterations to learn from.
2) The goal of this approach is to test how well the language models can generalize their knowledge based on limited input, simulating human learning processes where we often encounter new situations or tasks only once or a few times before having to apply our understanding.
Follow Up Input: How does that compare to Few-Shot?
Standalone question:[0m
 Can you explain how One-Shot prompting differs from Few-Shot prompting in 

In [27]:
print(result['answer']) # this is the final answer
print('\n\nSources:')
for document in result['source_documents']:  
    print(document.metadata['source'])

1) In terms of performance on specific tasks, GPT-3's few-shot results are generally slightly behind state-of-the-art fine-tuned models but still impressive. For example, in the SAT analogy task, it achieves 65.2% accuracy in the few-shot setting compared to human performance of around 57%.
    2) In terms of sample efficiency, GPT-3's ability to learn from a small number of examples is remarkable and highlights its potential for transfer learning across different tasks without extensive fine-tuning.


Sources:
Language Models are Few-Shot Learners, page 7
Language Models are Few-Shot Learners, page 24
