### Developer RAG Chatbot

In this notebook, we are going to build a basic developer chatbot. The Developer RAG Chatbot is intended to provide an example RAG workflow for developers. This example uses RAPIDS cuDF source code and API documentation as a representative dataset of a developer's codebase. We will use this dataset to create a code chatbot/assistant that can answer questions about cuDF and provide examples of using the API. Note that the example is intended to make it easier for developers to interact and come up to speed with a code base, but not necessarily fully generate code for the developer.

To build this application, we'll be using Llama3 70B hosted on NV AI Foundation as the LLM and the E5-Large embedding model. We'll add the embeddings into a FAISS vector database and use Langchain to build the logic tying the pieces together. Finally, we'll use Gradio as the interface for accessing the chatbot.

![title](diagram.png)

Prerequisites

1. Setup your NVIDIA NGC account and generate an API Key: https://python.langchain.com/docs/integrations/chat/nvidia_ai_endpoints/#setup
2. An NVIDIA GPU with at least 4 GB of memory is required to run the embedding model and create the necessary vectorstores

### Step 1: Pull cuDF Dataset
First, we pull the cuDf 24.04 release from GitHub.

In [None]:
!wget https://github.com/rapidsai/cudf/archive/refs/tags/v24.04.00.tar.gz
!tar -xzf v24.04.00.tar.gz

### Step 2: Parse Source Code and Documentation
Next, we parse the relevant python source code and related documentation.

In [None]:
from langchain.document_loaders import DirectoryLoader, TextLoader, PythonLoader
import os

CLONE_DIR = "cudf-24.04.00"
SOURCE_DIR = os.path.join(CLONE_DIR, "python", "cudf", "cudf")
SOURCE_DOC_DIR = os.path.join(CLONE_DIR, "docs", "cudf", "source", "user_guide")

text_loader_kwargs={'autodetect_encoding': True}

code_loader = DirectoryLoader(SOURCE_DIR, glob="**/*.py", use_multithreading=True, loader_cls=PythonLoader)
code_data = code_loader.load()
print("Code files found: " + str(len(code_data)))

#delete index files to avoid irrelevant results
doc_index = os.path.join(SOURCE_DOC_DIR,"index.md")
api_doc_index = os.path.join(SOURCE_DOC_DIR,"api_docs","index.rst")
if(os.path.isfile(doc_index)):
   os.remove(doc_index)
if(os.path.isfile(api_doc_index)):
    os.remove(api_doc_index)

doc_loader = DirectoryLoader(SOURCE_DOC_DIR, glob="**/*.md", use_multithreading=True, loader_cls=TextLoader)
doc_data = doc_loader.load()
doc_loader = DirectoryLoader(SOURCE_DOC_DIR, glob="**/*.ipynb", use_multithreading=True, loader_cls=TextLoader)
doc_data = doc_data + doc_loader.load()
print("Documentation files found: " + str(len(doc_data)))

api_loader = DirectoryLoader(SOURCE_DOC_DIR, glob="**/*.rst", use_multithreading=True, loader_cls=TextLoader)
api_data = api_loader.load()
print("API files found: " + str(len(api_data)))

### Step 3: Split Data to Prepare for Embedding
In this step, we split our data into smaller chunks for the embedding process.

**Note: It may take several minutes for the e5-large-v2 model to download.**

In [None]:
import time
from langchain.text_splitter import (Language, SentenceTransformersTokenTextSplitter, RecursiveCharacterTextSplitter)

TEXT_SPLITTER_MODEL = "intfloat/e5-large-v2"
TEXT_SPLITTER_CHUNK_SIZE = 512
TEXT_SPLITTER_CHUNK_OVERLAP = 256

text_splitter = SentenceTransformersTokenTextSplitter(
    model_name=TEXT_SPLITTER_MODEL,
    chunk_size=TEXT_SPLITTER_CHUNK_SIZE,
    chunk_overlap=TEXT_SPLITTER_CHUNK_OVERLAP,
)

python_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON, chunk_size=512, chunk_overlap=256)

start_time = time.time()

code_docs = python_splitter.split_documents(code_data)

documents = text_splitter.split_documents(doc_data)

api_docs= text_splitter.split_documents(api_data)

print(f"--- {time.time() - start_time} seconds ---")

### Step 4: Generate Embeddings and Store Embeddings in the Vector Store 
Next, we generate our embeddings from our dataset, and store them in the appropriate vector stores.
This process will generally take several minutes, depending on your hardware.
A cached version of each vector store will be saved locally for use in future notebook runs.

In [None]:
from langchain.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
import time
import os

start_time = time.time()

#load embeddings model
model_name = "intfloat/e5-large-v2"
model_kwargs = {"device": "cuda"}
encode_kwargs = {"normalize_embeddings": False}
embeddings = HuggingFaceEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs,
    show_progress=True
)

vectorstore_docs_path = "doc_index"
vectorstore_code_path = "code_index"
vectorstore_api_path = "api_index"

vectorstore_doc = None
vectorstore_code = None
vectorstore_api = None



#load or create individual vectorstores as appropriate

if os.path.exists(vectorstore_docs_path):
    #load doc vectorstores
    vectorstore_doc = FAISS.load_local(vectorstore_docs_path, embeddings, allow_dangerous_deserialization=True)
else:
    #run to create doc vectorstore
    vectorstore_doc = FAISS.from_documents(documents, embeddings)
    vectorstore_doc.save_local(vectorstore_docs_path)

if os.path.exists(vectorstore_code_path):
    #load code vectorstore
    vectorstore_code = FAISS.load_local(vectorstore_code_path, embeddings, allow_dangerous_deserialization=True)
else:
    # create code vectorestore
    vectorstore_code = FAISS.from_documents(code_docs, embeddings)
    vectorstore_code.save_local(vectorstore_code_path)

if os.path.exists(vectorstore_api_path):
    #load api vectorstore
    vectorstore_api = FAISS.load_local(vectorstore_api_path, embeddings,allow_dangerous_deserialization=True)
else:
    #create api vectorstore
    vectorstore_api = FAISS.from_documents(api_docs, embeddings)
    vectorstore_api.save_local(vectorstore_api_path)


### Step 5: Test Embeddings
Here we pass in a simple test query to ensure we are pulling relevant chunks from our code vector store. Notice it should include the 'size' function definition from the frame.py script as part of the retrieved context.

In [None]:
retriever_docs = vectorstore_code.as_retriever(search_kwargs= {"k":3})

test_docs = retriever_docs.get_relevant_documents("How can I check the size of my dataframe?")

for doc in test_docs:
    print(doc, end="\n")
    print("\n")


### Step 6: Connect to LLM
Here we create the connection to the Llama3-70b model via the NVIDIA AI Foundation Endpoint.

In [None]:
from langchain_nvidia_ai_endpoints import ChatNVIDIA
import getpass

#if you haven't already passed in your NVIDIA API KEY in the docker file, you can enter it manually here
if not os.environ.get("NVIDIA_API_KEY", "").startswith("nvapi-"):
    nvapi_key = getpass.getpass("Enter your NVIDIA API key: ")
    assert nvapi_key.startswith("nvapi-"), f"{nvapi_key[:5]}... is not a valid key"
    os.environ["NVIDIA_API_KEY"] = nvapi_key

#Try using the llama3-8b-instruct model to see how results can differ!
llm = ChatNVIDIA(
    temperature=0.01,
    max_tokens=1024,
    model="meta/llama3-70b-instruct",
    stream= True
)

### Step 7: Create prompt pipeline
Next we create the prompt for our chatbot. We've broken it into several pieces to make it easier to understand the individual portions of the prompt. We bring the individual pieces of the prompt together using a pipeline prompt at the end of the section.

In [None]:
from langchain.prompts.pipeline import PipelinePromptTemplate
from langchain.prompts import PromptTemplate

#Llama3 Prompt template
full_template = """ <s>[INST] <<SYS>>
{introduction}
{example}
<</SYS>>
|
{start} [/INST]"""

full_prompt = PromptTemplate.from_template(full_template)

introduction_template = """ You are an expert on the RAPIDs cuDF framework. Only provide answers around cuDF functionality. Don't return answers for topics that aren't related to cuDF. If you don't know the answer, just say that you don't know, don't try to make up an answer."""
introduction_prompt = PromptTemplate.from_template(introduction_template)

example_template = """Here's an example of an interaction:

Question: {example_q}
Answer: {example_a}

Use the following context to answer the user's question. Context: {context} Chat History: {history}  Only return the helpful answer below and nothing else. Don't make up functions, variables, or properties. Only include functions, variables, or properties for which you have a source. Provide only a single, best example when answering the question."""
example_prompt = PromptTemplate.from_template(example_template)

start_template = """  Assume the user is asking about the Python implementation. Don't provide an answer if it's unrelated to cuDF. Question: {question} Answer:"""
start_prompt = PromptTemplate.from_template(start_template)

input_prompts = [
    ("introduction", introduction_prompt),
    ("example", example_prompt),
    ("start", start_prompt),
]
pipeline_prompt = PipelinePromptTemplate(
    final_prompt=full_prompt, pipeline_prompts=input_prompts
)

### Step 8: Create Retrievers
In this section, we create retrievers to access the data in our vector stores. We add additional parameters and filtering to ensure only the most relevant documents are returned.

In [None]:
from langchain.retrievers.merger_retriever import MergerRetriever
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import DocumentCompressorPipeline
from langchain.retrievers.document_compressors import EmbeddingsFilter

#create merger retriever to combine results from multiple vectorstores
merger_retriever = MergerRetriever(retrievers=[])
retriever_code = vectorstore_code.as_retriever(search_type = "similarity_score_threshold", search_kwargs= {"k":4, "score_threshold": 0.75})
retriever_docs = vectorstore_doc.as_retriever(search_type = "similarity_score_threshold", search_kwargs= {"k":4, "score_threshold": 0.7})
retriever_api = vectorstore_api.as_retriever(search_type = "similarity_score_threshold", search_kwargs= {"k":4, "score_threshold": 0.7})

filter_ordered_by_retriever =  EmbeddingsFilter(embeddings=embeddings, k = 5, sorted = True)

pipeline = DocumentCompressorPipeline(transformers=[filter_ordered_by_retriever])
compression_retriever = ContextualCompressionRetriever(
    base_compressor=pipeline, base_retriever=merger_retriever)

#update merger_retriever based on selected vectorstores
def update_retriever(kb_code, kb_docs, kb_api, merger_retriever):
    retrievers = []

    if kb_code:
        retrievers.append(retriever_code)
    if kb_docs:
        retrievers.append(retriever_docs)
    if kb_api:
        retrievers.append(retriever_api)

    merger_retriever.retrievers = retrievers

### Step 9: Implement Chatbot Logic
In this section, we implement the main logic for our chatbot. This includes the chatbot response function, managing the size of the chat history, and adding sources to the response.

In [None]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough, RunnableParallel, RunnableLambda
import gradio as gr
import re

welcome_message = [(None, "Hello! I'm your cuDF Assistant! How can I help you?")]

example_questions = [["How do I check the size of a data frame?"],
                     ["What are the main differences between the cuDF and pandas APIs?"],
                     ["What is the default data type value returned inside a dataframe when calling get_dummies?"],
                     ["Is output order guaranteed when using the join function?"]                     ]

def choose_chat_response(message, history, knowledge_base):
    if not message or message.isspace():
        yield "Please enter a question."
    else:
        kb_docs = True if 0 in knowledge_base else False
        kb_api = True if 1 in knowledge_base else False
        kb_code = True if 2 in knowledge_base else False

        use_kb = False
        #if any knowledge bases selected, use RAG pipeline
        if kb_code or kb_docs or kb_api:
            update_retriever(kb_code, kb_docs, kb_api, merger_retriever)
            use_kb = True
        yield from chat_response(message, history, use_kb)

#reset chat history
def reset(z):
    return welcome_message, []

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

def limit_chat_history(chat_history, char_limit):
    total_chars = sum( 0 if user is None else len(user) + 0 if bot is None else len(bot) for user, bot in chat_history)
    while total_chars > char_limit:
        # Remove the oldest message pair
        removed_user, removed_bot = chat_history.pop(0)
        total_chars -= (0 if removed_user is None else len(removed_user) + len(removed_bot))

    history_text = ""
    #go through history and add each q+a pair
    for qa_pair in chat_history:
        q = qa_pair[0]
        a = qa_pair[1]

        #initial user input is None due to initialization value
        if q is not None:
            history_text+= ("user: " + q + "\n")
        #only pass along the response without sources so the LLM doesn't learn to include them automatically
        history_text+=("response: " + a.split("Sources:")[0] + "\n")

    return history_text

#get sources from doc metadata and format appropriately
def get_sources(docs):
    try:
        sources=[]
        for doc in docs:
            metadata = doc.metadata
            source = metadata['source']

            if source.endswith('.md') or source.endswith('.ipynb') or source.endswith('.rst'):
                url_start = "https://docs.rapids.ai/api/cudf/stable"
                source = source.replace('cudf-24.04.00/docs/cudf/source', url_start)
                source = source.replace('.md','')
                source = source.replace('.ipynb','')
                source = source.replace('.rst','')

            elif source.endswith('.py'):
                url_start = "https://github.com/rapidsai/cudf/blob/branch-24.04/python/cudf"
                source = source.replace('cudf-24.04.00/python/cudf',url_start)

            if source not in sources:
                sources.append(source)
        if(len(sources) == 0):
            sources.append("No relevant sources found within the selected knowledge bases.")
        return sources
    except:
        print("source parse error")

def chat_response(message, history, use_kb):

    #prompt is currently around ~1500 chars
    #context is ~500x5 = ~2500 chars
    #Llama3 context limit  8k
    #limiting history to 4k chars for now
    history_text = limit_chat_history(history,4000)

    formatted_context = None
    if use_kb:
        context = compression_retriever.get_relevant_documents(message)
        formatted_context = format_docs(context)

    build_prompt = pipeline_prompt.format_prompt( context= formatted_context, example_q = "How can I check what's in the first row of my dataframe?", example_a = """You can check what's in the first row of your dataframe by using the head() function. For example:

import cuDF

# create a sample dataframe
df = cuDF.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})

# print the first row of the dataframe
print(df.head(1))

This will output the first row of the dataframe, which in this case is:

   A  B
0  1  4 """, question = message, history= history_text)

    llm_chain = (
        llm
        | StrOutputParser()
    )

    result = ""
    for txt in llm_chain.stream(build_prompt):
        result += txt
        yield result

    if use_kb:
        sources = get_sources(context)
        ranked_list = [f"{index+1}: {value}" for index, value in enumerate(sources)]
        result = result +"\n\nSources:\n" + "\n".join(ranked_list)

    yield result

### Step 10: Start Chatbot
We're finally ready to start our chatbot. Run the cell below to create the Gradio interface and begin interacting with your chatbot!
Note the differences in the responses when enabling the various knowledge bases, and compare that to the same response without using any knowledge bases.

In [None]:
#You may need to explicitly hit the STOP button and try to relaunch your gradio interface when re-running this cell. This is a known Jupyter Notebook environment issue.
chatbot = gr.Chatbot(value = welcome_message)
with gr.Blocks() as demo:
    knowledge_base = gr.CheckboxGroup(label = "Knowledge Base Sources", info= "Choose which sources to use",choices=["Docs", "API Docs", 'Source Code',], type='index', value=['Docs', 'API Docs','Source Code'],  render=False)
    input_box = gr.Textbox(value = "How do I check the size of a data frame?", scale=4, render = False)
    chat = gr.ChatInterface(choose_chat_response,
                    additional_inputs_accordion = gr.Accordion(open=True, label = "Options", render=False),
                    additional_inputs=[knowledge_base],
                    examples = example_questions,
                    textbox = input_box,
                    title = "cuDF RAG Chatbot",
                    chatbot=chatbot,
                    concurrency_limit=1)
    #need to reset chat history when checkbox clicked so that the chatbot doesn't have potential answers from previous time question was asked
    knowledge_base.input(fn=reset, inputs=knowledge_base, outputs=[chatbot, chat.chatbot_state])

try:
    demo.launch(server_name="0.0.0.0", debug=True,  show_api=False)
    demo.close()
except Exception as e:
    demo.close()
    print(e)
    raise e