##  Notebook 2: Filling RAG outputs For Evaluation

In this notebook, we will use the example RAG pipeline to populate the RAG outputs: contexts (retrieved relevant documents) and answer (generated by RAG pipeline).

The example RAG pipeline provided as part of this repository uses [LlamaIndex](https://gpt-index.readthedocs.io/en/stable/) to build a chatbot that references a custom knowledge base. 

If you want to learn more about how the example RAG works, please see [03_llama_index_simple.ipynb](../notebooks/03_llama_index_simple.ipynb).

- **Steps 1-5**: Build the RAG pipeline.
- **Step 6**: Build the Query Engine, exposing the Retriever and Generator outputs
- **Step 7**: Fill the RAG outputs 

### Steps 1-5: Build the RAG pipeline

#### Define the LLM
Here we are using a local llm on triton and the address and gRPC port that the Triton is available on. 

***If you are using AI Playground (no local GPU) replace, the code in the cell two cells below with the following: ***

```
import os
from nv_aiplay import GeneralLLM
os.environ['NVAPI_KEY'] = "REPLACE_WITH_YOUR_API_KEY"

llm = GeneralLLM(
    model="llama2_70b",
    temperature=0.2,
    max_tokens=300
)
```

In [None]:
%%capture
!test -d dataset || unzip dataset.zip

In [None]:
from triton_trt_llm import TensorRTLLM
from llama_index.llms import LangChainLLM
trtllm =TensorRTLLM(server_url="llm:8001", model_name="ensemble", tokens=300)
llm = LangChainLLM(llm=trtllm)

#### Create a Prompt Template

A [**prompt template**](https://gpt-index.readthedocs.io/en/latest/core_modules/model_modules/prompts.html) is a common paradigm in LLM development.

They are a pre-defined set of instructions provided to the LLM and guide the output produced by the model. They can contain few shot examples and guidance and are a quick way to engineer the responses from the LLM. Llama 2 accepts the [prompt format](https://huggingface.co/blog/llama2#how-to-prompt-llama-2) shown in `LLAMA_PROMPT_TEMPLATE`, which we manipulate to be constructed with:
- The system prompt
- The context
- The user's question
  
Much like LangChain's abstraction of prompts, LlamaIndex has similar abstractions for you to create prompts.

In [None]:
# import the relevant libraries
from llama_index import Prompt

LLAMA_PROMPT_TEMPLATE = (
 "<s>[INST] <<SYS>>"
 "Use the following context to answer the user's question. If you don't know the answer, just say that you don't know, don't try to make up an answer."
 "<</SYS>>"
 "<s>[INST] Context: {context_str} Question: {query_str} Only return the helpful answer below and nothing else. Helpful answer:[/INST]"
)

qa_template = Prompt(LLAMA_PROMPT_TEMPLATE)

### Load Documents
Follow the step number 1 [defined here](../notebooks/05_dataloader.ipynb) to upload the pdf's to Milvus server.


In this rest of this section, we will load and split the pdfs of NVIDIA blogs. We will use the `SentenceTransformersTokenTextSplitter`.
Additionally, we use a LlamaIndex [``PromptHelper``](https://gpt-index.readthedocs.io/en/latest/api_reference/service_context/prompt_helper.html) to help deal with LLM context window token limitations. 

In [None]:
# import the relevant libraries
from langchain.text_splitter import SentenceTransformersTokenTextSplitter
from llama_index.node_parser import LangchainNodeParser
from llama_index import PromptHelper

# setup the text splitter
TEXT_SPLITTER_MODEL = "intfloat/e5-large-v2"
TEXT_SPLITTER_TOKENS_PER_CHUNK = 510
TEXT_SPLITTER_CHUNCK_OVERLAP = 200

text_splitter = SentenceTransformersTokenTextSplitter(
    model_name=TEXT_SPLITTER_MODEL,
    tokens_per_chunk=TEXT_SPLITTER_TOKENS_PER_CHUNK,
    chunk_overlap=TEXT_SPLITTER_CHUNCK_OVERLAP,
)

node_parser = LangchainNodeParser(text_splitter)


# Use the PromptHelper

prompt_helper = PromptHelper(
  context_window=4096,
  num_output=256,
  chunk_overlap_ratio=0.1,
  chunk_size_limit=None
)

#### Generate and Store Embeddings
##### a) Generate Embeddings 
[Embeddings](https://python.langchain.com/docs/modules/data_connection/text_embedding/) for documents are created by vectorizing the document text; this vectorization captures the semantic meaning of the text. 

We will use [intfloat/e5-large-v2](https://huggingface.co/intfloat/e5-large-v2) for the embeddings.

In [None]:
# import the relevant libraries
from langchain.embeddings import HuggingFaceEmbeddings
from llama_index.embeddings import LangchainEmbedding

#Running the model on CPU as we want to conserve gpu memory.
#In the production deployment (API server shown as part of the 5th notebook we run the model on GPU)
model_name="intfloat/e5-large-v2"
model_kwargs = {"device": "cuda:0"}
encode_kwargs = {"normalize_embeddings": False}
hf_embeddings = HuggingFaceEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs,
)
# Load in a specific embedding model
embed_model = LangchainEmbedding(hf_embeddings)

##### b) Store Embeddings 

We will use the LlamaIndex module [`ServiceContext`](https://gpt-index.readthedocs.io/en/latest/core_modules/supporting_modules/service_context.html) to bundle commonly used resources during the indexing and querying stage. 

In this example, we bundle the build resources: the LLM, the embedding model, the node parser, and the prompt helper.   

In [None]:
# import the relevant libraries
from llama_index import ServiceContext

# bundle the build resources
service_context = ServiceContext.from_defaults(
  llm=llm,
  embed_model=embed_model,
  node_parser=node_parser,
  prompt_helper=prompt_helper
)

Set the service context globally, to avoid passing it to every llm call.

In [None]:
from llama_index import set_global_service_context
set_global_service_context(service_context)

Ingest the dataset using the /uploadDocument endpoint in the chain-server.

In [None]:
import os
import requests
import mimetypes

def upload_document(file_path, url):
    headers = {
        'accept': 'application/json'
    }
    mime_type, _ = mimetypes.guess_type(file_path)
    files = {
        'file': (file_path, open(file_path, 'rb'), mime_type)
    }
    response = requests.post(url, headers=headers, files=files)

    return response.text

def upload_pdf_files(folder_path, upload_url):
    for files in os.listdir(folder_path):
        _, ext = os.path.splitext(files)
        # Ingest only pdf files
        if ext.lower() == ".pdf":
            file_path = os.path.join(folder_path, files)
            print(upload_document(file_path, upload_url))

In [None]:
import time

start_time = time.time()
upload_pdf_files("dataset", "http://query:8081/uploadDocument")
print(f"--- {time.time() - start_time} seconds ---")

<div class="alert alert-block alert-info">
    
⚠️ in the deployment of this workflow, [Milvus](https://milvus.io/) is running as a vector database microservice.
</div>

In [None]:
# import the relevant libraries
from llama_index import VectorStoreIndex
from llama_index.storage.storage_context import StorageContext
from llama_index.vector_stores import MilvusVectorStore

# store
vector_store = MilvusVectorStore(uri="http://milvus:19530",
    dim=1024,
    collection_name="document_store_ivfflat",
    index_config={"index_type": "IVF_FLAT", "nlist": 64},
    search_config={"nprobe": 16},
    overwrite=False
)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_vector_store(vector_store)

### Step 6: Build the Query Engine, exposing the Retriever and Generator outputs

#### a) Limit the Retriever Total Output Length

First, we need to restrict the output of the Retriever to a reasonable length so that the prompt can fit the context length of the LLM.
In this notebook, we will restrict it to 1000 (anything up to 1000 will ignored).


In [None]:
# import the relevant libraries
from llama_index.postprocessor.types import BaseNodePostprocessor
from typing import TYPE_CHECKING, List, Optional
from llama_index.utils import globals_helper
DEFAULT_MAX_CONTEXT = 1000

# limit the Retriever total outputs length
class LimitRetrievedNodesLength(BaseNodePostprocessor):
    """Llama Index chain filter to limit token lengths."""

    def _postprocess_nodes(
        self, nodes: List["NodeWithScore"], query_bundle: Optional["QueryBundle"] = None
    ) -> List["NodeWithScore"]:
        """Filter function."""
        included_nodes = []
        current_length = 0
        limit = DEFAULT_MAX_CONTEXT

        for node in nodes:
            current_length += len(
                globals_helper.tokenizer(
                    node.node.get_content(metadata_mode=MetadataMode.LLM)
                )
            )
            if current_length > limit:
                break
            included_nodes.append(node)

        return included_nodes


#### b) Build the Query Engine

Now, let's build the query engine that takes a query and returns a response. Each vector index has a default corresponding query engine; for example, the default query engine for a vector index performs a standard top-k retrieval over the vector store.
We will use `RetrieverQueryEngine` to get the output of the Retriever and generator. Learn more about the RetrieverQueryEngine in the [documentation](https://gpt-index.readthedocs.io/en/latest/examples/query_engine/CustomRetrievers.html).

 

In [None]:
# import the relevant libraries
from llama_index.query_engine import RetrieverQueryEngine
from llama_index.schema import MetadataMode

# Expose the retriever
retriever = index.as_retriever(similarity_top_k=2)

query_engine = RetrieverQueryEngine.from_args(
    retriever,
    text_qa_template=qa_template,
    node_postprocessors=[LimitRetrievedNodesLength()]
)

### Step 7: Fill the RAG outputs 

Let's now query the RAG pipeline and fill the outputs `contexts` and `answer` on the evaluation JSON file.

First, we need to load the previously generated dataset. So far, the RAG outputs fields are empty.


In [None]:
# import the relevant libraries
import json
from IPython.display import JSON

# load the evaluation data
f = open('qa_generation.json')
data = json.load(f)

# show the first element
JSON(data[0])

Let now query the RAG pipeline and populate the `contexts` and `answer` fields.

In [None]:
for entry in data:
    limited_retrieval_length = LimitRetrievedNodesLength()
    retrieved_text = ""
    response = query_engine.query(entry["question"])
    entry["answer"] = response.response
    print(entry["answer"])
    nodes = retriever.retrieve(entry["question"])
    included_nodes = limited_retrieval_length.postprocess_nodes(nodes)
    for node in included_nodes:
        retrieved_text = retrieved_text + " " + node.text
    entry["contexts"] = [retrieved_text]

In [None]:
# json_list_string=json.dumps(data)

# show again the first element
JSON(data[0])

Let now save the new evaluation datasets.

In [None]:
import json
with open('eval.json', 'w') as f:
    json.dump(data, f)

In the next notebook, we will evaluate the [Corp Comms Copilot](https://gitlab-master.nvidia.com/chat-labs/rag-demos/corp-comms-copilot) RAG pipeline.