##  Notebook 2: Filling RAG outputs For Evaluation

In this notebook, we will use the example RAG pipeline to populate the RAG outputs: contexts (retrieved relevant documents) and answer (generated by RAG pipeline).

The example RAG pipeline provided as part of this repository uses langchain to build a chatbot that references a custom knowledge base. 

If you want to learn more about how the example RAG works, please see [03_llama_index_simple.ipynb](../notebooks/03_llama_index_simple.ipynb).

- **Steps 1-5**: Build the RAG pipeline.
- **Step 6**: Build the Query Engine, exposing the Retriever and Generator outputs
- **Step 7**: Fill the RAG outputs 

#### Steps 1-5: Build the RAG pipeline

#### Define the LLM
Here we are using a local llm on triton and the address and gRPC port that the Triton is available on. 

***If you are using AI Playground (no local GPU) replace, the code in the cell below with the following: ***
```
import os
from langchain_nvidia_ai_endpoints import ChatNVIDIA, NVIDIAEmbeddings
os.environ['NVAPI_KEY'] = "REPLACE_WITH_YOUR_API_KEY"

nv_embedder = NVIDIAEmbeddings(model="nvolveqa_40k")
nv_document_embedder = NVIDIAEmbeddings(model="nvolveqa_40k", model_type="passage")
nv_query_embedder = NVIDIAEmbeddings(model="nvolveqa_40k", model_type="query")
```

In [None]:
! pip install langchain --upgrade

In [None]:
import os
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import DirectoryLoader
from langchain.document_loaders import TextLoader
from langchain_community.document_loaders import UnstructuredHTMLLoader
from langchain_community.document_loaders import UnstructuredPDFLoader
from langchain_community.document_loaders import PyPDFLoader
from langchain.document_loaders import UnstructuredFileLoader, Docx2txtLoader
from langchain.vectorstores import FAISS
import pickle
from langchain.embeddings import HuggingFaceEmbeddings
import re
import pandas as pd

In [None]:
import sys
# from triton_trt_llm import TensorRTLLM
# trtllm =TensorRTLLM(server_url="llm:8001", model_name="ensemble", tokens=300)

DOCS_DIR = "../ORAN_kb/" #
DOCS_DIR = "../ORAN_kb/"
from transformers import AutoTokenizer, AutoModel
############################################
# Component #2 - Embedding Model and LLM
############################################
os.environ['NVIDIA_API_KEY'] = ""
os.environ['NVAPI_KEY'] = ""
from langchain_nvidia_ai_endpoints import ChatNVIDIA, NVIDIAEmbeddings

# make sure to export your NVIDIA AI Playground key as NVIDIA_API_KEY!
llm = ChatNVIDIA(model="mixtral_8x7b")
nv_embedder = NVIDIAEmbeddings(model="nvolveqa_40k")
nv_document_embedder = NVIDIAEmbeddings(model="nvolveqa_40k", model_type="passage")
nv_query_embedder = NVIDIAEmbeddings(model="nvolveqa_40k", model_type="query")

model_name = "intfloat/e5-large-v2"
model_kwargs = {"device": "cuda"}
encode_kwargs = {"normalize_embeddings": True}
e5_embeddings = HuggingFaceEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs,
)

from langchain.embeddings import HuggingFaceBgeEmbeddings
model_name = "facebook/dragon-plus-context-encoder"
model_kwargs = {'device': 'cuda'}
encode_kwargs = {'normalize_embeddings': True} # set True to compute cosine similarity
dragon_embeddings = HuggingFaceBgeEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs,
    query_instruction="Represent this sentence for searching relevant passages: "
)
dragon_embeddings.query_instruction = "Represent this sentence for searching relevant passages: "

In [None]:
print(list(llm.available_models))

In [None]:
tokenizer = AutoTokenizer.from_pretrained('facebook/dragon-plus-query-encoder')
query_encoder = AutoModel.from_pretrained('facebook/dragon-plus-query-encoder')
context_encoder = AutoModel.from_pretrained('facebook/dragon-plus-context-encoder')

#### Create a Prompt Template

A [**prompt template**](https://gpt-index.readthedocs.io/en/latest/core_modules/model_modules/prompts.html) is a common paradigm in LLM development.

They are a pre-defined set of instructions provided to the LLM and guide the output produced by the model. They can contain few shot examples and guidance and are a quick way to engineer the responses from the LLM. Llama 2 accepts the [prompt format](https://huggingface.co/blog/llama2#how-to-prompt-llama-2) shown in `LLAMA_PROMPT_TEMPLATE`, which we manipulate to be constructed with:
- The system prompt
- The context
- The user's question
  
Much like LangChain's abstraction of prompts, LlamaIndex has similar abstractions for you to create prompts.

In [None]:
# import the relevant libraries
from langchain.prompts import PromptTemplate

LLAMA_PROMPT_TEMPLATE = (
 "<s>[INST] <<SYS>>"
 "Use the following context to answer the user's question. If you don't know the answer, just say that you don't know, don't try to make up an answer."
 "<</SYS>>"
 "<s>[INST] Context: {context} Question: {question} Only return the helpful answer below and nothing else. Helpful answer:[/INST]"
)

LLAMA_PROMPT = PromptTemplate.from_template(LLAMA_PROMPT_TEMPLATE)
llm_llama = ChatNVIDIA(
    model="playground_llama2_70b",
    temperature=0.2,
    max_tokens=300
)

llm_mixtral = ChatNVIDIA(model="mixtral_8x7b")

### Load Documents
Follow the step number 1 [defined here](../notebooks/05_dataloader.ipynb) to upload the pdf's to Milvus server.


In this rest of this section, we will load and split the pdfs of ORAN pdfs. We will use the `RecursiveCharacterTextSplitter`.

In [None]:
# Path to the vector store file
vector_store_path = "vectorstore.pkl"
DOCS_DIR = "../ORAN_kb"
# Load raw documents from the directory
text_loader_kwargs={'autodetect_encoding': True} #loader_kwargs=text_loader_kwargs
raw_txts = DirectoryLoader(DOCS_DIR, glob="**/*.txt", show_progress=True, loader_cls=TextLoader).load()
raw_htmls = DirectoryLoader(DOCS_DIR, glob="**/*.html", show_progress=True, loader_cls=UnstructuredHTMLLoader).load()
raw_pdfs = DirectoryLoader(DOCS_DIR, glob="**/*.pdf", show_progress=True, loader_cls=UnstructuredPDFLoader).load()
raw_docs = DirectoryLoader(DOCS_DIR, glob="**/*.docx", show_progress=True, loader_cls=Docx2txtLoader).load()

#### Generate and Store Embeddings
##### a) Generate Embeddings 
[Embeddings](https://python.langchain.com/docs/modules/data_connection/text_embedding/) for documents are created by vectorizing the document text; this vectorization captures the semantic meaning of the text. 

We will use [intfloat/e5-large-v2](https://huggingface.co/intfloat/e5-large-v2) or nvidia retriever for the embeddings.

In [None]:
def remove_line_break(text):
    text = text.replace("\n", " ").strip()
    text = re.sub("\.\.+", "", text)
    text = re.sub(" +", " ", text)
    return text
    
def remove_two_points(text):
    text = text.replace("..","")
    return text
    
def remove_two_slashes(text):
    text = text.replace("__","")
    return text

def just_letters(text):
    return re.sub(r"[^a-z]+", "", text).strip()

def remove_non_english_letters(text):
    return re.sub(r"[^\x00-\x7F]+", "", text)

def langchain_length_function(text):
    return len(just_letters(remove_line_break(text)))

def word_count(text):
    text = remove_line_break(text)
    text = just_letters(text)
    tokenizer = RegexpTokenizer(r"[A-Za-z]\w+")
    tokenized_text = tokenizer.tokenize(text)
    tokenized_text_len = len(tokenized_text)
    return tokenized_text_len

def truncate(text, max_word_count=1530):
    tokenizer = RegexpTokenizer(r"[A-Za-z]\w+")
    tokenized_text = tokenizer.tokenize(text)
    return " ".join(tokenized_text[:max_word_count])

def strip_non_ascii(string):
    ''' Returns the string without non ASCII characters'''
    stripped = (c for c in string if 0 < ord(c) < 127)
    return ''.join(stripped)

def generate_questions(chunk, triton_client):
    triton_client.predict_streaming(chunk)
    chat_history = ""
    response_streaming_list = []
    while True:
        try:
            response_streaming = triton_client.user_data._completed_requests.get(block=True)
            
        except Exception:
            triton_client.close_streaming()
            break
    
        if type(response_streaming) == InferenceServerException:
            print("err")
            triton_client.close_streaming()
            break
    
        if response_streaming is None:
            triton_client.close_streaming()
            break
    
        else:
            response_streaming_list.append(response_streaming)
            chat_history = triton_client.prepare_outputs(response_streaming_list)
            yield chat_history

In [None]:
# Check for existing vector store file
vector_store_exists = os.path.exists(vector_store_path)
vectorstore = None
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100, length_function=langchain_length_function, )

all_documents=raw_pdfs+raw_docs

# split all the docs
docs = text_splitter.split_documents(all_documents)
print(len(docs))
print(docs[0])

In [None]:
documents = docs
#remove short chuncks
filtered_documents = [item for item in documents if len(item.page_content) >= 200]
[(len(item.page_content),item.page_content) for item in documents]
documents = filtered_documents
pd.DataFrame([doc.metadata for doc in documents])['source'].unique()
#remove line break
for i in range(0,len(documents)-1):
    documents[i].page_content=remove_line_break(documents[i].page_content)
#remove two points
for i in range(0,len(documents)-1):
    documents[i].page_content=remove_two_points(documents[i].page_content)
#remove non english characters points
for i in range(0,len(documents)-1):
    documents[i].page_content=remove_two_slashes(documents[i].page_content)
#remove two points
for i in range(0,len(documents)-1):
    documents[i].page_content=remove_two_points(documents[i].page_content)
[(len(item.page_content),item.page_content) for item in documents]

##### b) Store Embeddings 

We will use the LlamaIndex module [`ServiceContext`](https://gpt-index.readthedocs.io/en/latest/core_modules/supporting_modules/service_context.html) to bundle commonly used resources during the indexing and querying stage. 

In this example, we bundle the build resources: the LLM, the embedding model, the node parser, and the prompt helper.   

In [None]:
docs = documents
print(len(docs))
vectorstore_e5 = FAISS.from_documents(docs, e5_embeddings)
print("Done e5")
vectorstore_dragon = FAISS.from_documents(docs, dragon_embeddings)
print("Done DRAGON")
# vectorstore_nvolve = FAISS.from_documents(docs, embedding=NVIDIAEmbeddings(model="nvolveqa_40k"))
# vectorstore_nvolve = FAISS.from_documents(docs, nv_embedder)
# print("Done Nvovlve")
vector_store_path = "vectorstore_e5.pkl"
with open(vector_store_path, "wb") as f:
    pickle.dump(vectorstore_e5, f)
vector_store_path = "vectorstore_dragon.pkl"
with open(vector_store_path, "wb") as f:
    pickle.dump(vectorstore_dragon, f)
    # pickle.dump(vectorstore_nvolve, f)

file = open("./vectorstore_e5.pkl",'rb')
vectorstore_e5 = pickle.load(file)
file.close()
file = open("./vectorstore_dragon.pkl",'rb')
vectorstore_dragon = pickle.load(file)
file.close()

print(len(vectorstore_e5.docstore._dict))
print(len(vectorstore_dragon.docstore._dict))

In [None]:
#nvidia embeddings
docs = documents
vector_store_path = "vectorstore_nv.pkl"
vectorstore_nvolve = FAISS.from_documents(docs, nv_embedder)
print(len(vectorstore_nvolve.docstore._dict))
with open(vector_store_path, "wb") as f:
    pickle.dump(vectorstore_nvolve, f)

# file = open("./vectorstore_nv.pkl",'rb')
# vectorstore_nvolve = pickle.load(file)
# file.close()

In [None]:
print(len(vectorstore_nvolve.docstore._dict))

### Step 6: Build the Query Engine, exposing the Retriever and Generator outputs


In [None]:
import sys
from langchain_core.runnables import RunnablePassthrough
import os
os.environ['NVIDIA_API_KEY'] = ""
os.environ['NVAPI_KEY'] = ""
os.environ['NGC_API_KEY'] = ""

llm = ChatNVIDIA(model="playground_llama2_70b")
# print(llm.get_model_details)
result = llm.invoke("Write a ballad about LangChain.")
print(result.content)

# retriever = vectorstore_nvolve.as_retriever()
# prompt = ChatPromptTemplate.from_messages(
#     [
#         (
#             "system",
#             "Answer solely based on the following context:\n<Documents>\n{context}\n</Documents>",
#         ),
#         ("user", "{question}"),
#     ]
# )

# chain = (
#     {"context": retriever, "question": RunnablePassthrough()}
#     | prompt
#     | llm
#     | StrOutputParser()
# )

# chain.invoke("where did harrison work?")


In [None]:
############################################
# Component #4 - LLM Response Generation and Chat
############################################
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate

prompt_template = ChatPromptTemplate.from_messages(
    [("system", "You are a helpful AI assistant named Envie. You will reply to questions only based on the context that you are provided. If something is out of context, you will refrain from replying and politely decline to respond to the user."), ("user", "{input}")]
)
user_input = "What are the restrictions on using the material from the O-RAN ALLIANCE e.V. specification?"
chain = prompt_template | llm | StrOutputParser()
print(list(llm.available_models))
print(llm.model)
if user_input and vectorstore_e5!=None:
    retriever = vectorstore_e5.as_retriever(search_kwargs={"k": 3, "score_threshold": 0.5})
    docs = retriever.get_relevant_documents(user_input)
    
    context = ""
    for doc in docs:
        context += doc.page_content + "\n\n"
    
    augmented_user_input = "Context: " + context + "\n\nQuestion: " + user_input + "\n"

    full_response = ""

    for response in chain.stream({"input": augmented_user_input}):
        full_response += response

print(full_response)

# from langchain.chains import RetrievalQA
# import time

# qa_chain = RetrievalQA.from_chain_type(
#     llm,
#     retriever=vectorstore_nvolve.as_retriever(search_kwargs={"k": 1, "score_threshold": 0.8}),
#     chain_type_kwargs={"prompt": LLAMA_PROMPT}
# )
# start_time = time.time()
# result = qa_chain({"query": user_input})
# print(result)
print("\n")
print(context)

### Step 7: Fill the RAG outputs 

Let's now query the RAG pipeline and fill the outputs `contexts` and `answer` on the evaluation JSON file.

First, we need to load the previously generated dataset. So far, the RAG outputs fields are empty.


In [None]:
# import the relevant libraries
import json
from IPython.display import JSON
# load the evaluation data
f = open('syn_data_oran.json')
syn_data = json.load(f)
print(len(syn_data))
# show the first element
JSON(syn_data[1])

Let now query the RAG pipeline and populate the `contexts` and `answer` fields.

In [None]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate

prompt_template = ChatPromptTemplate.from_messages(
    [("system", "You are a helpful and friendly intelligent AI assistant bot named ORAN Chatbot, deployed by the Artificial Intelligence Solutions Architecture and Engineering team at NVIDIA. The context given below will provide some documentation as well as ORAN specifications. Based on this context, answer the following question related to ORAN standards and specifications. If the question is not related to this, please refrain from answering."), ("user", "{input}")]
)

chain = prompt_template | llm | StrOutputParser()
## Using e5 Retriever on Synthetic data
for count,entry in enumerate(syn_data):
    user_input = entry['question']
    print(count, user_input)
    retriever = vectorstore_e5.as_retriever(search_kwargs={"k": 5, "score_threshold": 0.8})
    docs = retriever.get_relevant_documents(user_input)
    context = ""
    cons = []
    for doc in docs:
        context += doc.page_content + "\n\n"
        cons.append(doc.page_content)
#     print(context)
    augmented_user_input = "Context: " + context + "\n\nQuestion: " + user_input + "\n"
    full_response = ""
    for response in chain.stream({"input": augmented_user_input}):
        full_response += response   
    # result = qa_chain({"query": user_input})
    # entry["answer"] = result['result']
    entry["answer"] = full_response
#     print(entry["answer"])
    entry["contexts"] = cons #[context]
    
print(syn_data[100])
import json
with open('eval_syn_QA_E5.json', 'w') as f:
    json.dump(syn_data, f)

In [None]:
## Using dragon Retriever on Synthetic data
for count,entry in enumerate(syn_data):
    user_input = entry['question']
    print(count, user_input)
    retriever = vectorstore_dragon.as_retriever(search_kwargs={"k": 5, "score_threshold": 0.8})
    docs = retriever.get_relevant_documents(user_input)
    context = ""
    cons = []
    for doc in docs:
        context += doc.page_content + "\n\n"
        cons.append(doc.page_content)
#     print(context)
    augmented_user_input = "Context: " + context + "\n\nQuestion: " + user_input + "\n"
    full_response = ""
    for response in chain.stream({"input": augmented_user_input}):
        full_response += response   
    # result = qa_chain({"query": user_input})
    # entry["answer"] = result['result']
    entry["answer"] = full_response
#     print(entry["answer"])
    entry["contexts"] = cons #[context]
    
print(syn_data[100])
import json
with open('eval_syn_QA_Dragon.json', 'w') as f:
    json.dump(syn_data, f)

In [None]:
## Using Nvolve Retriever on Synthetic data
for count,entry in enumerate(syn_data):
    user_input = entry['question']
    print(count, user_input)
    retriever = vectorstore_nvolve.as_retriever(search_kwargs={"k": 5, "score_threshold": 0.8})
    docs = retriever.get_relevant_documents(user_input)
    context = ""
    cons = []
    for doc in docs:
        context += doc.page_content + "\n\n"
        cons.append(doc.page_content)
    print("\n", context)
    augmented_user_input = "Context: " + context + "\n\nQuestion: " + user_input + "\n"
    full_response = ""
    for response in chain.stream({"input": augmented_user_input}):
        full_response += response   
    # result = qa_chain({"query": user_input})
    # entry["answer"] = result['result']
    entry["answer"] = full_response
#     print(entry["answer"])
    entry["contexts"] = cons #[context]
    
print(syn_data[100])
import json
with open('eval_syn_QA_Nvolve.json', 'w') as f:
    json.dump(syn_data, f)