# Introduction
In this notebook, we will pick up where we have left in the [preparation notebook](https://github.com/shaaagri/iat481-nlp-proj/blob/main/LLama2_vanilla_bot.ipynb) and will add a vector store with a retriever to the pipeline. This should be enough to lay the framework to realize our intention - a chatbot powered by RAG (Retrieval Augmented Generation), which is, in essence, a special case of automated prompt engineering. Just a reminder, the specialized knowledge we plan to inject into the chatbot is concerned with sleep hygiene and related science-backed tips.

![title](images/RAG_overview_diagram.png)

# Workflow

1. Setting Up LLama-2 and LangChain
2. Text Embeddings and the Vector Store
3. Preparing a RAG Pipeline Using Sample Data
4. Completing the RAG Pipeline Using Real Data

# Setting Up LLama-2 and LangChain

The next section mostly repeats the code from the preparation notebook. If that notebook has been run already, running this section may not be required as the kernel should keep its state. However, this section may diverge from the previous notebook, so it is recommended to re-run all of the cells here.

### Prerequisites

In [12]:
# GPU llama-cpp-python
%set_env CMAKE_ARGS="-DLLAMA_CUBLAS=on"
%set_env FORCE_CMAKE=1
!pip install llama-cpp-python --upgrade --verbose
!pip install huggingface_hub
!pip install llama-cpp-python

env: CMAKE_ARGS="-DLLAMA_CUBLAS=on"
env: FORCE_CMAKE=1
Using pip 24.0 from C:\Program Files\Python312\Lib\site-packages\pip (python 3.12)
Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable


In [1]:
from huggingface_hub import hf_hub_download
from llama_cpp import Llama

### Model

In [2]:
model_name_or_path = "TheBloke/Llama-2-7B-chat-GGUF"
model_basename = "llama-2-7b-chat.Q4_K_M.gguf"

Before downloading the model again, which can be time-consuming, check the Hugging Face Hub's cache folder where it may be stored during the previous notebook runs. 

In [3]:
model_path = hf_hub_download(repo_id=model_name_or_path, filename=model_basename)

### LangChain

In [307]:
from langchain_community.llms import LlamaCpp
from langchain_core.callbacks import CallbackManager, StreamingStdOutCallbackHandler
from langchain_core.prompts import PromptTemplate

Here we write the system prompt inside a basic template used to initialize the chatbot. During our experements we have noticed it exerts a lot of influence on the bot's behavior, being no less important than the Llama-2 parameters.

In [322]:
prompt_template=f'''[INST]
<<SYS>>
You are helpful, respectful, caring and honest assistant. You do not have expressions or emotions. You are objective and provide everything that is helpful to know given the question, but you are not chatty. Answer as helpfully as you possibly can.
<</SYS>>

USER: {question}

ASSISTANT: 
[/INST]
'''

In [323]:
prompt = PromptTemplate(
    input_variables=["question"],
    template=prompt_template,
)

The model then can be easily initialized thanks to LangChain's built-in `llama.cpp` wrapper ([documentation](https://python.langchain.com/docs/integrations/llms/llamacpp/)).

In [324]:
# Callbacks support token-wise streaming
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])

In [330]:
llm = LlamaCpp(
    # Make sure the model path is correct for your system!
    model_path="/Users/Narratic-DEV002/.cache/huggingface/hub/models--TheBloke--Llama-2-7B-chat-GGUF/snapshots/191239b3e26b2882fb562ffccdd1cf0f65402adb/llama-2-7b-chat.Q4_K_M.gguf",
    
    temperature=0.6,
    n_gpu_layers=-1,  # -1 stands for offloading all layers of the model to GPU => better performance (we've got enough VRAM)
    n_ctx=4096,  # IMPORTANT for RAG, the default for quantized GGUF models is only 512
    max_tokens=1024,
    repeat_penalty=1.02,
    top_p=0.8, # nucleus sampling
    top_k=150,  # sample from k top tokens 
    callback_manager=callback_manager,
    verbose=True,  # Verbose is required to pass to the callback manager
)

llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from /Users/Narratic-DEV002/.cache/huggingface/hub/models--TheBloke--Llama-2-7B-chat-GGUF/snapshots/191239b3e26b2882fb562ffccdd1cf0f65402adb/llama-2-7b-chat.Q4_K_M.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.

In [11]:
from langchain.globals import set_debug

# debugging on demand
set_debug(True) 

In [327]:
question='Describe the main campus of the Simon Fraser University'

In [331]:
llm.invoke(prompt.format(question=question))

[32;1m[1;3m[llm/start][0m [1m[1:llm:LlamaCpp] Entering LLM run with input:
[0m{
  "prompts": [
    "[INST]\n<<SYS>>\nYou are helpful, respectful, caring and honest assistant. You do not have expressions or emotions. You are objective and provide everything that is helpful to know given the question, but you are not chatty. Answer as helpfully as you possibly can.\n<</SYS>>\n\nUSER: Describe the main campus of the Simon Fraser University\n\nASSISTANT: \n[/INST]"
  ]
}
Thank you for asking! The main campus of Simon Fraser University is located in Burnaby, British Columbia, Canada. It spans over 300 acres and is home to many of the university's academic and research buildings, including the SFU Library, the Faculty of Arts and Social Sciences, the Faculty of Science, and the Faculty of Applied Sciences. The campus is also home to a variety of student services, such as the Student Union Building, the SFU Bookstore, and the SFU Health and Counselling Centre. Additionally, the campus fe


llama_print_timings:        load time =     820.17 ms
llama_print_timings:      sample time =     129.32 ms /   536 runs   (    0.24 ms per token,  4144.60 tokens per second)
llama_print_timings: prompt eval time =    7502.30 ms /    98 tokens (   76.55 ms per token,    13.06 tokens per second)
llama_print_timings:        eval time =   57520.61 ms /   535 runs   (  107.52 ms per token,     9.30 tokens per second)
llama_print_timings:       total time =   67575.09 ms /   633 tokens


[36;1m[1;3m[llm/end][0m [1m[1:llm:LlamaCpp] [67.58s] Exiting LLM run with output:
[0m{
  "generations": [
    [
      {
        "text": "Thank you for asking! The main campus of Simon Fraser University is located in Burnaby, British Columbia, Canada. It spans over 300 acres and is home to many of the university's academic and research buildings, including the SFU Library, the Faculty of Arts and Social Sciences, the Faculty of Science, and the Faculty of Applied Sciences. The campus is also home to a variety of student services, such as the Student Union Building, the SFU Bookstore, and the SFU Health and Counselling Centre. Additionally, the campus features a variety of green spaces, including gardens, parks, and walking trails.\nHere are some of the key buildings and facilities located on the main campus:\n* SFU Library: The SFU Library is a hub of academic activity, providing access to a vast collection of books, journals, and other resources. It also offers study spaces, compu

"Thank you for asking! The main campus of Simon Fraser University is located in Burnaby, British Columbia, Canada. It spans over 300 acres and is home to many of the university's academic and research buildings, including the SFU Library, the Faculty of Arts and Social Sciences, the Faculty of Science, and the Faculty of Applied Sciences. The campus is also home to a variety of student services, such as the Student Union Building, the SFU Bookstore, and the SFU Health and Counselling Centre. Additionally, the campus features a variety of green spaces, including gardens, parks, and walking trails.\nHere are some of the key buildings and facilities located on the main campus:\n* SFU Library: The SFU Library is a hub of academic activity, providing access to a vast collection of books, journals, and other resources. It also offers study spaces, computer labs, and other services to support student learning.\n* Faculty of Arts and Social Sciences: This faculty is home to a wide range of pro

# Text Embeddings and the Vector Store

As our RAG bot is going to rely on the supply of extra knowledge that we will manually package into the project (in the form of Q&A data), here comes a crucial part - choosing a text embedding model and the vector store. The former will take care of converting our textual Q&A data into vector representation which is required to do the semantic similarity comparison later - in other words, to match to the best of our ability the user question to the appropriate piece of information within the extra knowledge. The latter is going to neatly store these representations, providing access to them as needed. These two nodes are cornerstones of any RAG project and the use cases and the range of choices for the models and the vector stores are well documented.

### Choosing the Text Embedding Model

For a long time, there was little choice for a specific model that produces the embeddings beside OpenAI's `ada-002`, which is provided through API requiring a small fee to use. However, by April 2024 (the time of writing this notebook) the range has considerably increased, and now there are not only players in the market (e.g. [Cohere](https://cohere.com/embeddings), [Jina](https://jina.ai/embeddings/) - both offer a free tier) but also open-source text embeddings model can be found, such as `SentenceTransformers` available at Hugging Face ([link](https://huggingface.co/sentence-transformers)). 

As students we are delighted to be able to use another model free of charge; our only question is whether it performs comparably to ada-002. The good news is that our brief research has told us we should be fine with the open-source Sentence Transformers (which come as a [family of models](https://www.sbert.net/docs/pretrained_models.html]) each trading off performance for quality in various ways) - here are the resources we are referring to: [(1)](https://iamnotarobot.substack.com/p/should-you-use-openais-embeddings), [(2)](https://www.reddit.com/r/MachineLearning/comments/11okrni/discussion_compare_openai_and_sentencetransformer/), [(3)](https://supabase.com/blog/fewer-dimensions-are-better-pgvector), ([4](https://weaviate.io/blog/how-to-choose-a-sentence-transformer-from-hugging-face])).

The consensus seems to be that it's not necessary to use ada-002 at all as the open-source models match it and sometimes even exceed it in performance. One particular text embedding model that seems to have an ideal balance between size, speed, and accuracy is `all-MiniLM-L6-v2`. It also has an "older brother", a slightly larger model `all-MiniLM-L12-v2`, and according to [this table](https://www.sbert.net/docs/pretrained_models.html), it's only marginally better than all-MiniLM-L6-v2, while being significantly slower. All in all, we think the all-MiniLM-L6-v2 model is an excellent start, given our use case is mostly concerned with general purpose English words. It is also supported by LangChain out of the box. 

In [100]:
!pip install sentence-transformers

Defaulting to user installation because normal site-packages is not writeable
Collecting sentence-transformers
  Downloading sentence_transformers-2.6.1-py3-none-any.whl.metadata (11 kB)
Collecting transformers<5.0.0,>=4.32.0 (from sentence-transformers)
  Downloading transformers-4.39.3-py3-none-any.whl.metadata (134 kB)
     ---------------------------------------- 0.0/134.8 kB ? eta -:--:--
     ----- ------------------------------- 20.5/134.8 kB 682.7 kB/s eta 0:00:01
     -------------------------------------- 134.8/134.8 kB 2.0 MB/s eta 0:00:00
Collecting torch>=1.11.0 (from sentence-transformers)
  Downloading torch-2.2.2-cp312-cp312-win_amd64.whl.metadata (26 kB)
Collecting scikit-learn (from sentence-transformers)
  Downloading scikit_learn-1.4.2-cp312-cp312-win_amd64.whl.metadata (11 kB)
Collecting scipy (from sentence-transformers)
  Downloading scipy-1.13.0-cp312-cp312-win_amd64.whl.metadata (60 kB)
     ---------------------------------------- 0.0/60.6 kB ? eta -:--:--
   



In [103]:
from langchain_community.embeddings.sentence_transformer import (
    SentenceTransformerEmbeddings,
)

# create the open-source embedding function
embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

### Choosing the Vector Store

As for the vector database that is going to store the embeddings, the decision is considerably easier. It comes down to two well-known alternatives: `Pinecone` (managed) and `ChromaDB` (self-hosted). To remind the reader, our guiding design principle is to get away with open-source and/or free-tier components for 100% of the pipeline, hence ChromeDB is the obvious choice. To consult with some literature we checked, for instance, [this article](https://medium.com/@sakhamurijaikar/which-vector-database-is-right-for-your-generative-ai-application-pinecone-vs-chromadb-1d849dd5e9df), and it confirmed our assumptions that ChromaDB should be more than enough for what is just a student prototype.

In [137]:
!pip install chromadb

Defaulting to user installation because normal site-packages is not writeable
Collecting chromadb
  Downloading chromadb-0.4.24-py3-none-any.whl.metadata (7.3 kB)
Collecting build>=1.0.3 (from chromadb)
  Downloading build-1.2.1-py3-none-any.whl.metadata (4.3 kB)
Collecting chroma-hnswlib==0.7.3 (from chromadb)
  Downloading chroma-hnswlib-0.7.3.tar.gz (31 kB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Collecting fastapi>=0.95.2 (from chromadb)
  Downloading fastapi-0.110.1-py3-none-any.whl.metadata (24 kB)
Collecting uvicorn>=0.18.3 (from uvicorn[standard]>=0.18.3->chromadb)
  Downloading uvicorn-0.29.0-py3-none-any.whl.metadata (6.3 kB)
Collecting posthog>=2.4.0 (from chromadb)
  Downloading posthog-3.



In [138]:
from langchain_community.vectorstores import Chroma

# Preparing a RAG Pipeline Using Sample Data

Following the spirit of moving steadily but in small steps, we first set up a RAG pipeline with a sample text consisting of our class syllabus from the Canvas :)

In [108]:
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import CharacterTextSplitter

In [109]:
import os 

# checking which directory is our root
print(os.getcwd())

C:\Users\Narratic-DEV002\Desktop\iat481-nlp-proj


Connecting ChromaDB is a trivial, [well-documented](https://python.langchain.com/docs/integrations/vectorstores/chroma/) task. We pick an appropriate file loader from LangChain's toolkit and split our text file into chunks. Splitting is an important step when working with external knowledge data since if we don't do it, we risk not fitting our augmented prompt into Llama-2's context window (4096 tokens max).

In [246]:
def clear_chroma_db(db):
    if db is None:
        return
    
    try:
        db.delete_collection()
    except:
        pass

In [253]:
# Code based on examples from the LangChain documentation: 
# https://python.langchain.com/docs/integrations/vectorstores/chroma/

# load the sample document 
loader = TextLoader("./samples/481_syllabus.txt")
documents = loader.load()

# split it into chunks
text_splitter = CharacterTextSplitter(chunk_size=1024, chunk_overlap=64)
docs = text_splitter.split_documents(documents)

# load it into Chroma
clear_chroma_db(db)
db = Chroma.from_documents(docs, embedding_function)

# query it
query = "What the IAT481 course is about?"
docs = db.similarity_search(query)

# print results
print(docs[0].page_content)

Created a chunk of size 1394, which is longer than the specified 1024


Course Name: 
Exploring Artificial Intelligence: Its Use, Concepts, and Impact

Course Description:

This course is cross-listed as IAT481 (undergraduate) / IAT885 (graduate).

This course is designed to provide a comprehensive and accessible introduction to the world of artificial intelligence that will empower the students to navigate the AI-driven future. Students will explore fundamental AI concepts, including machine learning, neural networks, natural language processing, and computer vision; discover real-world applications, ethical considerations, and the societal impact of AI. 

 
Course Info:

Course will be held between Jan 8 â€“ Apr 12, 2024: Thu, 12:30â€“2:20 p.m. @ SRYC3170 . Tutorial sessions will be held weekly after the course @ SRYC3050

Instructor: Dr. O. Nilay Yalcin oyalcin@sfu.ca , Office Hours: Wednesdays 12:30 â€“ 2:30pm @SRYC 2282 (by email appointment only, contact at least 1 day before)â€‹

TA: Maryiam Zahoor maryiam_zahoor@sfu.ca, Office Hours: Wednesdays SRY

Very nice! Querying the vector store indeed gave us a relevant chunk. Let's do some other query, though, to confirm this result is not random:

In [254]:
query = "What is the e-mail policy in IAT 481?"
docs = db.similarity_search(query)
print(docs[0].page_content)

Consider e-mail etiquette,  http://www.albion.com/netiquette/corerules.html when sending an email to us. 
To promote understanding with your reader:

    Write a clear subject line that shows your section number and the purpose of the email. Include course number in email subject: "IAT 418/885: .... ". Thanks!

    Identify your audience by name (i.e Hi Nilay or Hello Maryiam)

    Compose a direct, concise message with a clear purpose (i.e I have a question about todayâ€™s activity or I will not be in class next week.)

    Proofread and use appropriate language for the context of your message--friendly and professional. 

    Close with your name and student number (i.e Regards, Brenda Sans (301001010))

Email Protocols

    Your Instructor and TA will reply to e-mails within 24 hours during weekdays.

    We do not answer emails after 5pm, or on weekends and holidays.

    Requests for grade changes and extensions must be sent directly to the course Instructor.


Looks good. The chunk returned to us matches our query. We can now plug this into Llama-2, replace the sample data with our Q&A pairs, and, hopefully, end up with a working RAG-powered application.

# Completing the RAG Pipeline Using Real Data

To recap, our project goal is get our chatbot to provide advice on better sleep. For a moment, we were not sure what data format to use for this purpose. However, some production-ready datasets such as [MedQuad-MedicalQnADataset by keivalya](https://huggingface.co/datasets/keivalya/MedQuad-MedicalQnADataset) are comprised of **simple Q&A pairs** listed in a normal **CSV file**. We chose the same route (thankfully, LangChain packages a file loader for that - [here is an example of its use](https://betterprogramming.pub/build-a-chatbot-on-your-csv-data-with-langchain-and-openai-ed121f85f0cd)).

Basically, let's try loading and chunking our CSV data first:

In [155]:
from langchain_community.document_loaders.csv_loader import CSVLoader

In [173]:
loader = CSVLoader(file_path="./data/BetterSleep_QnADataset.csv")

csv_data = loader.load()

In [260]:
# split it into chunks
text_splitter = CharacterTextSplitter(chunk_size=1024, chunk_overlap=64)
data = text_splitter.split_documents(csv_data)

# load it into Chroma
clear_chroma_db(db)
db = Chroma.from_documents(data, embedding_function)

# query it
query = "What is the best sleep schedule?"
docs = db.similarity_search(query)

# print results
print(docs[0].page_content)

Question: What sleep schedule is the most optimal?
Answer: Research has shown it is better to set fixed times for going to bed and waking up and then adhere to them even on weekends. Our bodies heavily rely on circadian rhythms to regulate sleep patterns. If you go to bed at the same time every day it helps your body to acquire a steady habit of production of melatonin prior to your bedtime. Melatonin is a very important hormone that is related to circadian rhythms, day and night cycles. Increased level of melatonin is shown to help falling asleep faster and having a more quality sleep.


Great, clearly our CSV data has been ingested into the vector DB successfully, and querying it behaves as expected. Our dataset is small, therefore a simple similarity search should suffice. It must be noted, however, that had we needed to pick top `k` chunks with even more precision and relevance (e.g. from a very large database with hundreds of thousands of records), this technique could be made more advanced by combining it with a **Reranker** - basically, a separate DNN model that reorders the results taking a deeper look into the nuances of semantics (an example description of that can be found [here](https://superlinked.com/vectorhub/articles/optimizing-rag-with-hybrid-search-reranking)).

The last missing puzzle piece is to modify our previous Llama-2 setup so that our prompt is augmented behind-the-scenes with data from our custom Q&A data. The beauty of LangChain is that there is a ready-made component for such a chain: `RetrievalQA`. We refer to the [official documentation](https://js.langchain.com/docs/modules/chains/popular/vector_db_qa) and, especially, [this article](https://www.mlexpert.io/blog/langchain-quickstart-with-llama-2#simple-retrieval-augmented-generation-rag) - both sources greatly helped us in setting it up.

In [265]:
from langchain.chains import RetrievalQA

We also must re-write our prompt a bit (we partially sourced [this template](https://smith.langchain.com/hub/rlm/rag-prompt)), because now we are going to put in the prompt not only the question but also supplemented context (the core mechanism of RAG):

In [332]:
prompt_template="""[INST]
<<SYS>>
You are helpful, respectful, caring and honest assistant for question-answering tasks. You do not have expressions or emotions. You are objective and provide everything that is helpful to know given the question, but you are not chatty, be concise and do not use more than three sentences. Use the following pieces of retrieved context to answer the question to the best of your ability. If you don't know the answer, just say that you don't know.
<</SYS>>

USER: {question}

CONTEXT: {context}

ASSISTANT: 
[/INST]
"""

prompt = PromptTemplate(template=prompt_template, input_variables=["question", "context"])

Now we can prepare our RAG-enabling chain:

In [333]:
qa_chain = RetrievalQA.from_chain_type(
    llm,

    # We use k=1 to always pick only the most relevant Q&A pair. Our dataset is small so that should suffice and we won't bloat the prompt
    retriever=db.as_retriever(search_kwargs={"k": 1}),
    chain_type_kwargs={"prompt": prompt}
)

Oof, let's finally test it!

In [335]:
question = "Tell me some tips for the best sleep schedule"

In [339]:
result = qa_chain.invoke(question)

[32;1m[1;3m[chain/start][0m [1m[1:chain:RetrievalQA] Entering Chain run with input:
[0m{
  "query": "Tell me some tips for the best sleep schedule"
}
[32;1m[1;3m[chain/start][0m [1m[1:chain:RetrievalQA > 3:chain:StuffDocumentsChain] Entering Chain run with input:
[0m[inputs]
[32;1m[1;3m[chain/start][0m [1m[1:chain:RetrievalQA > 3:chain:StuffDocumentsChain > 4:chain:LLMChain] Entering Chain run with input:
[0m{
  "question": "Tell me some tips for the best sleep schedule",
  "context": "Question: What sleep schedule is the most optimal?\nAnswer: Research has shown it is better to set fixed times for going to bed and waking up and then adhere to them even on weekends. Our bodies heavily rely on circadian rhythms to regulate sleep patterns. If you go to bed at the same time every day it helps your body to acquire a steady habit of production of melatonin prior to your bedtime. Melatonin is a very important hormone that is related to circadian rhythms, day and night cycles. 

Llama.generate: prefix-match hit


 get the best sleep schedule, it's recommended to establish a fixed bedtime and wake-up time every day, including weekends. This helps regulate your body's internal clock, or circadian rhythm, which relies on melatonin production to control sleep patterns. By going to bed at the same time each day, your body can adjust to producing melatonin at the right time, leading to faster falling asleep and better quality sleep.


llama_print_timings:        load time =     820.17 ms
llama_print_timings:      sample time =      21.39 ms /    98 runs   (    0.22 ms per token,  4580.94 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_print_timings:        eval time =   10382.56 ms /    98 runs   (  105.94 ms per token,     9.44 tokens per second)
llama_print_timings:       total time =   10779.77 ms /    99 tokens


[36;1m[1;3m[llm/end][0m [1m[1:chain:RetrievalQA > 3:chain:StuffDocumentsChain > 4:chain:LLMChain > 5:llm:LlamaCpp] [10.79s] Exiting LLM run with output:
[0m{
  "generations": [
    [
      {
        "text": "To get the best sleep schedule, it's recommended to establish a fixed bedtime and wake-up time every day, including weekends. This helps regulate your body's internal clock, or circadian rhythm, which relies on melatonin production to control sleep patterns. By going to bed at the same time each day, your body can adjust to producing melatonin at the right time, leading to faster falling asleep and better quality sleep.",
        "generation_info": null,
        "type": "Generation"
      }
    ]
  ],
  "llm_output": null,
  "run": null
}
[36;1m[1;3m[chain/end][0m [1m[1:chain:RetrievalQA > 3:chain:StuffDocumentsChain > 4:chain:LLMChain] [10.79s] Exiting Chain run with output:
[0m{
  "text": "To get the best sleep schedule, it's recommended to establish a fixed bedtime and

Simply awesome! The response we have got is sensible, concise and factually correct, and its generation did not take a lot of time (we get about ~10 tokens per second, which is just fine for streaming). We also can see in the debugging information that the relevant Q&A pair is being used as part of the hidden augmented prompt. Everything is in its right place and now we only need to package this code into a Python project and set it up so that it can be served to users through a simple web UI (any open-source chatbot UI would suffice, for example, [Gradio](https://www.gradio.app/) mentioned previously).