# Rag chatbot
Let's take the next step in our journey! We've explored LLMs, chatbots, and RAG. Now, it's time to put them all together to create a powerful tool: a RAG chain with memory.

---
## 1.&nbsp; Installations and Settings üõ†Ô∏è
Except one item, this is the same as the last notebook. The only new item is a line to download a saved verion of the vector database created from Alice's Adventures in Wonderland. This saves us loading, splitting, and vectorising the book all over again.

In [None]:
!pip3 install -qqq langchain --progress-bar off
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip3 install -qqq llama-cpp-python --progress-bar off
!pip3 install -qqq sentence_transformers --progress-bar off
!pip3 install -qqq faiss-gpu --progress-bar off

!huggingface-cli download TheBloke/Mistral-7B-Instruct-v0.1-GGUF mistral-7b-instruct-v0.1.Q4_K_M.gguf --local-dir . --local-dir-use-symlinks False

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for llama-cpp-python (pyproject.toml) ... [?25l[?25hdone
Consider using `hf_transfer` for faster downloads. This solution comes with some limitations. See https://huggingface.co/docs/huggingface_hub/hf_transfer for more details.
downloading https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/resolve/main/mistral-7b-instruct-v0.1.Q4_K_M.gguf to /root/.cache/huggingface/hub/tmpzvgvot1y
mistral-7b-instruct-v0.1.Q4_K_M.gguf: 100% 4.37G/4.37G [00:44<00:00, 98.2MB/s]
./mistral-7b-instruct-v0.1.Q4_K_M.gguf


In [None]:
# in case you get the error 'NoneType' object has no attribute 'groups'
!pip install --upgrade gdown

# download saved vector database for Alice's Adventures in Wonderland
!gdown --folder 1A8A9lhcUXUKRrtCe7rckMlQtgmfLZRQH

Collecting gdown
  Downloading gdown-5.1.0-py3-none-any.whl (17 kB)
Installing collected packages: gdown
  Attempting uninstall: gdown
    Found existing installation: gdown 4.7.3
    Uninstalling gdown-4.7.3:
      Successfully uninstalled gdown-4.7.3
Successfully installed gdown-5.1.0
Retrieving folder contents
Processing file 1h_lk4wTr12FAEaCS3eIJ4xsdcmnuIGmt index.faiss
Processing file 1O0Jz2Lx5cZdpQM7S5uw6Kx9_OLm5DuSQ index.pkl
Retrieving folder contents completed
Building directory structure
Building directory structure completed
Downloading...
From: https://drive.google.com/uc?id=1h_lk4wTr12FAEaCS3eIJ4xsdcmnuIGmt
To: /content/faiss_index/index.faiss
100% 421k/421k [00:00<00:00, 2.96MB/s]
Downloading...
From: https://drive.google.com/uc?id=1O0Jz2Lx5cZdpQM7S5uw6Kx9_OLm5DuSQ
To: /content/faiss_index/index.pkl
100% 216k/216k [00:00<00:00, 2.14MB/s]
Download completed


---
## 2.&nbsp; Setting up the chain üîó
There are 2 new items in this code that you haven't seen before:
* the `output_key` parameter in [ConversationBufferMemory](https://api.python.langchain.com/en/latest/memory/langchain.memory.buffer.ConversationBufferMemory.html)
* [ConversationalRetrievalChain](https://api.python.langchain.com/en/latest/chains/langchain.chains.conversational_retrieval.base.ConversationalRetrievalChain.html#)

The `ConversationalRetrievalChain` is the LangChain chain for RAG with memory.

The `output_key` parameter is necessary if you want to include both `memory` and `return_source_documents` with `ConversationalRetrievalChain`. Without this parameter in `memory`, you'll get an error in `ConversationalRetrievalChain` when using both the aformentioned parameters as it gets two parameters when it's expecting one. Adding this parameter to `memory` means it knows which output to accept. If you're not using both `memory` and `return_source_documents` with `ConversationalRetrievalChain`, this isn't necessary.

In [1]:
from langchain.llms import LlamaCpp
from langchain.vectorstores import FAISS
from langchain import PromptTemplate
from langchain.chains import ConversationalRetrievalChain
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.memory import ConversationBufferMemory

# llm
llm = LlamaCpp(model_path = "/Users/sadiakhanrupa/Bootcamp Main Phase/Chapter_8_generative_Ai/Chapter_8_generative_Ai/Model/faiss_index/mistral-7b-instruct-v0.1.Q4_K_M.gguf",
               max_tokens = 2000,
               temperature = 0.1,
               top_p = 1,
               n_gpu_layers = -1,
               n_ctx = 1024)

# embeddings
import os

embedding_model = "sentence-transformers/all-MiniLM-l6-v2"
current_folder = os.getcwd()
embeddings_folder = current_folder  # Using the current directory

embeddings = HuggingFaceEmbeddings(model_name=embedding_model, cache_folder=embeddings_folder)

# load vector Database
# allow_dangerous_deserialization is needed. Pickle files can be modified to deliver a malicious payload that results in execution of arbitrary code on your machine
vector_db = FAISS.load_local("/Users/sadiakhanrupa/Bootcamp Main Phase/Chapter_8_generative_Ai/Chapter_8_generative_Ai/Model/vector", embeddings, allow_dangerous_deserialization=True)

# retriever
retriever = vector_db.as_retriever(search_kwargs={"k": 2})

# memory
memory = ConversationBufferMemory(memory_key='chat_history',
                                  return_messages=True,
                                  output_key='answer')

# prompt
template = """
<s> [INST]
You are polite and professional question-answering AI assistant. You must provide a helpful response to the user.

In your response, PLEASE ALWAYS:
  (0) Be a detail-oriented reader: read the question and context and understand both before answering
  (1) Start your answer with a friendly tone, and reiterate the question so the user is sure you understood it
  (2) If the context enables you to answer the question, write a detailed, helpful, and easily understandable answer. If you can't find the answer, respond with an explanation, starting with: "I couldn't find the answer in the information I have access to".
  (3) Ensure your answer answers the question, is helpful, professional, and formatted to be easily readable.
  (4) Give funniest answers
[/INST]
[INST]
Answer the following question using the context provided.
The question is surrounded by the tags <q> </q>.
The context is surrounded by the tags <c> </c>.
<q>
{question}
</q>
<c>
{context}
</c>
[/INST]
</s>
[INST]
Helpful Answer:
[INST]
"""

prompt = PromptTemplate(template=template,
                        input_variables=["context", "question"])

# chain
chain = ConversationalRetrievalChain.from_llm(llm,
                                              retriever=retriever,
                                              memory=memory,
                                              return_source_documents=True,
                                              combine_docs_chain_kwargs={"prompt": prompt})

llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from /Users/sadiakhanrupa/Bootcamp Main Phase/Chapter_8_generative_Ai/Chapter_8_generative_Ai/Model/faiss_index/mistral-7b-instruct-v0.1.Q4_K_M.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = mistralai_mistral-7b-instruct-v0.1
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                

In [2]:
chain.invoke("Who is the queen?")


llama_print_timings:        load time =    3255.06 ms
llama_print_timings:      sample time =      35.26 ms /    87 runs   (    0.41 ms per token,  2467.11 tokens per second)
llama_print_timings: prompt eval time =   18584.95 ms /   531 tokens (   35.00 ms per token,    28.57 tokens per second)
llama_print_timings:        eval time =    5359.66 ms /    86 runs   (   62.32 ms per token,    16.05 tokens per second)
llama_print_timings:       total time =   24505.94 ms /   617 tokens


{'question': 'Who is the queen?',
 'chat_history': [HumanMessage(content='Who is the queen?'),
  AIMessage(content='The Queen in the story is the Queen of Hearts. She is a character in Lewis Carroll\'s "Alice\'s Adventures in Wonderland". In the story, she is portrayed as a tyrannical ruler who is quick to anger and has a penchant for beheadings. The Queen is also known for her love of playing croquet and her dislike of hedgehogs.')],
 'answer': 'The Queen in the story is the Queen of Hearts. She is a character in Lewis Carroll\'s "Alice\'s Adventures in Wonderland". In the story, she is portrayed as a tyrannical ruler who is quick to anger and has a penchant for beheadings. The Queen is also known for her love of playing croquet and her dislike of hedgehogs.',
 'source_documents': [Document(page_content='‚ÄúWould you tell me,‚Äù said Alice, a little timidly, ‚Äúwhy you are\npainting those roses?‚Äù\n\nFive and Seven said nothing, but looked at Two. Two began in a low\nvoice, ‚ÄúWhy th

In [3]:
print(chain.invoke("What does she enjoy doing?")["answer"])

Llama.generate: prefix-match hit

llama_print_timings:        load time =    3255.06 ms
llama_print_timings:      sample time =       7.52 ms /    20 runs   (    0.38 ms per token,  2661.34 tokens per second)
llama_print_timings: prompt eval time =    4343.00 ms /   149 tokens (   29.15 ms per token,    34.31 tokens per second)
llama_print_timings:        eval time =    1021.27 ms /    19 runs   (   53.75 ms per token,    18.60 tokens per second)
llama_print_timings:       total time =    5521.28 ms /   168 tokens
Llama.generate: prefix-match hit

llama_print_timings:        load time =    3255.06 ms
llama_print_timings:      sample time =      16.06 ms /    59 runs   (    0.27 ms per token,  3673.27 tokens per second)
llama_print_timings: prompt eval time =   20449.66 ms /   715 tokens (   28.60 ms per token,    34.96 tokens per second)
llama_print_timings:        eval time =    3351.92 ms /    58 runs   (   57.79 ms per token,    17.30 tokens per second)
llama_print_timings:       to

The Queen in "Alice's Adventures in Wonderland" enjoys playing croquet with hedgehogs. She also likes to reminisce about her childhood and make her own children's eyes bright with strange tales, including the dream of Wonderland from long ago.


In [4]:
print(chain.invoke("Whose head does she chop off?")["answer"])

Llama.generate: prefix-match hit

llama_print_timings:        load time =    3255.06 ms
llama_print_timings:      sample time =       8.19 ms /    22 runs   (    0.37 ms per token,  2686.53 tokens per second)
llama_print_timings: prompt eval time =    6354.37 ms /   223 tokens (   28.49 ms per token,    35.09 tokens per second)
llama_print_timings:        eval time =    1145.78 ms /    21 runs   (   54.56 ms per token,    18.33 tokens per second)
llama_print_timings:       total time =    7702.24 ms /   244 tokens
Llama.generate: prefix-match hit

llama_print_timings:        load time =    3255.06 ms
llama_print_timings:      sample time =      10.10 ms /   124 runs   (    0.08 ms per token, 12279.66 tokens per second)
llama_print_timings: prompt eval time =   17664.93 ms /   618 tokens (   28.58 ms per token,    34.98 tokens per second)
llama_print_timings:        eval time =    6861.93 ms /   123 runs   (   55.79 ms per token,    17.92 tokens per second)
llama_print_timings:       to

The Queen in "Alice's Adventures in Wonderland" chops off the head of the White Rabbit.

In the story, the Queen is angry and frustrated with the players for not following her rules. She becomes even more enraged when she sees Alice playing with the hedgehogs instead of taking her turn. The Queen then orders the executioner to behead the players who have missed their turns, including the White Rabbit.

It's important to note that this is a fictional story and the actions of the characters are not meant to be taken seriously.
