# Walkthrough 3 - Chatting with your own data

During the 30 days, a number of people have indicated that the LLMs do not understand their context very well. LLMs are trained on very large amounts of textual data but this is very broad and usually not very deep. This makes them potentially less useful in specialist domains. We can overcome this to some extent using Prompt Engineering where we specify the relevant information as context to the prompt.

However, this quickly becomes tedious to find and include the relevant context each time you create a prompt. We also have to consider **Data Privacy** issues around the information we include in our prompts, who might have access to that data and how that data might be used. We also have to consider any relevant regulations that apply to our work.

There are 2 other challenges that we can often encounter:
1. **Hallucination** - This is where a model can generate output that is fictitious - this is understandable if we view LLMs as a **Probability Machine** rather than a **Truth Machine**. This gives us less confidence in the output since, depending on the model, the outputs can be convincing.
2. **Information Cut-Off** - Models *learn* a compressed world-view through their training but the information it learns from has a cut-off date. The older the model, the further back in time the information cut-off is. This means that the model has the potential to generate output based on outdated information. Again this reduces the usefulness of these models in some contexts.






We can reduce the impact of these challenges using an approach called Retrieval Augmented Generation (or RAG for short).

>A fairly non-technical description of RAG can be found at https://research.ibm.com/blog/retrieval-augmented-generation-RAG and it is worth reading through the post and/or watch the short video.

With RAG, we store our context specific documentation in a database (called a Vector Database) - actually, each document is broken down into overlapping chunks of text and undergoes a process called Embedding where we convert the document chunks into a numerical representation. This representation has a special property such that document chunks that have similarity context and semantic meaning have similar representations.
The mathematics behind this is complex but if you want to read more on Embeddings then read this post: https://towardsdatascience.com/how-i-explained-word-embeddings-to-my-non-technical-colleagues-52ced76cf3bb

When we create a prompt, we check the Vector Database for document chunks that are closely related to the prompt and add these as context to the prompt and send these to the LLM.

The LLM then use the context when generating output in response to the prompt. In this way we can overcome:
* The LLMs lack of knowledge about your context
* Reduce Hallucination
* Introduce new data that overcomes the information cut-off problem.


The ideas behind RAG are fairly new and evolving and platforms such as Azure AI or Amazon SageMaker already have fantastic support for this type of approach.

The challenge of Data Privacy can still exist depending upon how RAG is implemented - to keep your data private, you would need to create a private Vector Database and LLM instance (either using a cloud provider or hosting these yourself).

In this walkthrough we will create a very simple RAG application that allows you to chat with your own documents.
* The Vector Database will run in memory for this walkthrough so no information is persisted
* The LLM is a small Open-Source LLM so that it can be run in a Colab notebook.
* We will store the documents you are interested in "chatting" with in Google Drive.

If you adopt this approach within your own organisation, you would likely want to use a larger model for better results and have a persistent Vector Store for your documents. Setting up a RAG based solution can require a fair amount of code but the code is generally re-usable across applications.

# Let's Get Started
At a high-level we will
1. Install the required dependencies
2. Download an LLM Model (from HuggingFace)
3. Upload the documents you want to "chat" with
4. Initialise our Vector Database (Chroma DB) with the documents
5. Load and configure our LLM
6. Start Chatting

For this walkthrough you can ignore the code, it's included to provide an idea of what is required but understanding the code is not required to understand how RAG works - if code isn't your thing, just focus on the descriptions and run the code cells.

**IMPORTANT**

The LLM model we are running needs a GPU so, before we begin running code cells we need to switch to using a GPU.

> You can do this by selecting "Runtime -> Change Runtime Type" from the Colab menu.
>
> Then select the "T4 GPU" option and click on "Save"

This will give you access to run on a small GPU processor for a period of time.
Do this now before continuing with the walkthrough.

Notes:
* Occassionally Colab will not have any available GPUs for you to use; if this happens you will need to try again later.
* The name of the Runtime may be different in different regions so just pick one that has "GPU" in the label.

## 1. Installing Dependencies
Run the following code cells to load the dependencies into Colab and import them into memory. This can that a few minutes to complete.

We are using various libraries and the key ones are
* HuggingFace (https://huggingface.co/)
* LangChain (https://www.langchain.com/)
* ChromaDB (vector database) https://github.com/chroma-core/chroma

In [None]:
!pip install -q transformers peft  accelerate bitsandbytes accelerate safetensors sentencepiece streamlit chromadb langchain sentence-transformers gradio pypdf

In [None]:
# import dependencies
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, pipeline

import os
import gradio as gr
from google.colab import drive

import chromadb
from langchain.llms import HuggingFacePipeline
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
from langchain import HuggingFacePipeline
from langchain.document_loaders import PyPDFDirectoryLoader
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory

## 2. Downloading a Model from HuggingFace
HuggingFace is a Machine Learning Community that builds pre-trained models and datasets.
We are going to use an open source LLM model called `Zephyr-7b` - this is a 7 Billion parameter LLM. This might seem large but Chat-GPT 3.5 has about 20 Billion so is considerably larger.

The following code will download this model for us to use within Colab - this can take a while to download.

In [None]:
# specify model huggingface mode name
model_name = "anakin87/zephyr-7b-alpha-sharded"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
    )

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    quantization_config=bnb_config
    )

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.bos_token_id = 1  # Set beginning of sentence token id

When we provide a prompt to an LLM it is converted from the characters and words we see into a numerical representation. This process is called Tokenization. So we need a component called a Tokenizer - the following code cell creates this for us to use.

If you are interested in finding out more about this process I would recommend taking the free HuggingFace NLP course (https://huggingface.co/learn/nlp-course)

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.bos_token_id = 1  # Set beginning of sentence token id

## 3. Upload the documents you want to "chat" with
We are now going to upload the documents you want to use as your context into your Google Drive.

**REMEMBER** to to check that the documents you upload into Google Drive are allowed to be stored there by your Local IT Policy.

**NOTE** To explore the use of RAG the content of the PDF files doesn't really  matter, but if you are building a RAG based application for a specific purpose you would use documents that fit with the purpose.

1. Log into your Google Drive
2. At the root of your local drive, create a folder called `mot-day-28-docs`
3. Upload a set of example PDF files into this folder.

For this walkthrough we are restricting our context documents to be PDFs, but additional code could be added to accept other types of documents.



Once the documents are loaded, you need to mount your Google drive so that it is accessible to Colab.
Running the following cell will ask you to authorise the mounting of your Google Drive


In [None]:
# mount google drive and specify folder path
drive.mount('/content/drive')
folder_path = '/content/drive/MyDrive/mot-day-28-docs/'

In [None]:
import os

print("The following files were found in the 'mot-day-28-docs' folder")
for file in os.listdir(folder_path):
  print(f"\t{file}")

## 4. Embedd your documents into the Vector Database
We will use a PDF Reader to read in the document that you stored in your `mot-day-28-docs` on Google Drive adn then a Text Splitter to create a set of overlapping document chunks.

We will then use an Sentence Embedder to create embeddings of these chunks and then store them in the Vector Database (Chroma DB)

Depending on the size and number of PDFs you added to the folder, this can take some time to complete.

In [None]:
# load pdf files
loader = PyPDFDirectoryLoader(folder_path)
documents = loader.load()

# split the documents in small chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
all_splits = text_splitter.split_documents(documents)

# specify embedding model (using huggingface sentence transformer)
embedding_model_name = "sentence-transformers/all-mpnet-base-v2"
model_kwargs = {"device": "cuda"}
embeddings = HuggingFaceEmbeddings(model_name=embedding_model_name, model_kwargs=model_kwargs)

#embed document chunks
vectordb = Chroma.from_documents(documents=all_splits, embedding=embeddings, persist_directory="chroma_db")

# specify the retriever
retriever = vectordb.as_retriever()

5. Create a RAG Pipeline
To interact with our model and automate the process of finding the relevant context from our Vector Database, we will create a RAG pipeline using the HuggingFace library.

A Pipeline is a sequence of tests that are chained together and run as a single unit. HuggingFace makes this very easy for us.

Run the following code cell to create the pipeline.

In [None]:
# build huggingface pipeline for using zephyr-7b-alpha
pipeline = pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        use_cache=True,
        device_map="auto",
        max_length=2048,
        do_sample=True,
        top_k=5,
        num_return_sequences=1,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.eos_token_id,
)

# specify the llm
llm = HuggingFacePipeline(pipeline=pipeline)



In [None]:
def create_conversation(query: str, chat_history: list) -> tuple:
    try:

        memory = ConversationBufferMemory(
            memory_key='chat_history',
            return_messages=False
        )
        qa_chain = ConversationalRetrievalChain.from_llm(
            llm=llm,
            retriever=retriever,
            memory=memory,
            get_chat_history=lambda h: h,
        )

        result = qa_chain({'question': query, 'chat_history': chat_history})
        chat_history.append((query, result['answer']))
        return '', chat_history


    except Exception as e:
        chat_history.append((query, e))
        return '', chat_history

def ask_question(query:str):
  response = create_conversation(query, [])
  gen_out = response[1][0][1]
  response_start_token="Helpful Answer:"
  idx = gen_out.index(response_start_token)
  rag_prompt = gen_out[:idx]
  response_text = gen_out[idx:]

  return rag_prompt, response_text


## 5. Chat with your documents
We are now ready to chat with your documents

### 5.1 Using the pipeline
When we use RAG, the pipeline is automatically finding the relevant context from our uploaded documents and adding it to our prompt using a template that includes some specific instructions.

In the following cell you can replace the default prompt with a prompt of your own. Ask a question where the answer should be within the documents you uploaded. Make sure your prompt is between the quotation marks.

The run the code cell and obseve the content of the RAG PROMPT compared to the prompt you provided to understand what was added.

**NOTE:** You can change the prompt as many times as you want and re-run the cell to experiment with your local LLM and RAG.



In [None]:
my_prompt = "Default prompt - replace this with your own prompt to ask a question that can be answered using your uploaded documents"

rag_prompt, response_text = ask_question(my_prompt)

print(f"RAG PROMPT\n{rag_prompt}")
print(f"\nRAG RESPONSE\n{response_text}")

### 5.2 Chat Interface

If you want a ChatGPT style interface we can run a demo app for your model using a platform called Gradio.

**NOTE:** This is a temporary application that is publicly available via the unique link generated. So only run these cells if you have used documents that do not contain propritary information.

If you run these cells and access the Gradio application you can chat with your documents using normal prompts and will be able to observe the document chunks that the RAG pipeline extracted for your query and sent to the LLM to generate a response from.

In [None]:
# build conversational retrieval chain with memory (rag) using langchain
def create_conversation(query: str, chat_history: list) -> tuple:
    try:

        memory = ConversationBufferMemory(
            memory_key='chat_history',
            return_messages=False
        )
        qa_chain = ConversationalRetrievalChain.from_llm(
            llm=llm,
            retriever=retriever,
            memory=memory,
            get_chat_history=lambda h: h,
        )

        result = qa_chain({'question': query, 'chat_history': chat_history})
        chat_history.append((query, result['answer']))
        return '', chat_history


    except Exception as e:
        chat_history.append((query, e))
        return '', chat_history

In [None]:
# build gradio ui
with gr.Blocks() as demo:

    chatbot = gr.Chatbot(label='My Chatbot')
    msg = gr.Textbox()
    clear = gr.ClearButton([msg, chatbot])

    msg.submit(create_conversation, [msg, chatbot], [msg, chatbot])

demo.launch()

# Questions for Reflection
We have now implemented a basic RAG based approach to enhance the LLM Generation. It's only a basic implementation running with a small LLM and so the output generated may not be perfect but it should provide some insight into how we can inject context into an LLM.

Before closing this workbook, reflect on the following questions:

1. Did the RAG based LLM use context from your uploaded documents?
2. Do you think that the RAG based LLM included relevant content into the prompt?
3. Assuming your team wants to incorprate an LLM into the Testing toolbelt, what are the use cases for using this approach in your team? What documents might you load into the Vector Database so that prompts have access to them?

