# RAG-a-Thon Sample Application

A RAG-a-Thon is a hackathon focused on RAG (Retrieval-Augmented Generation) use cases. A RAG application is inherently complex due to its multifaceted nature. It requires integrating various components and infrastructure to create a cohesive application. Tuning and optimizing the performance of RAG models is essential for improving their effectiveness in real-world applications. Below is a typical flow diagram for a RAG application: 


In [1]:
import torch
import time
from pathlib import Path

from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Milvus
from langchain.document_loaders import UnstructuredFileLoader
from langchain.text_splitter import SentenceTransformersTokenTextSplitter

## Document Processing and Chunking

This code snippet is designed for extracting text from PDF documents and processing it into manageable chunks optimized for further analysis with machine learning models. Initially, it sets up an embedding model named `intfloat/e5-large-v2` and defines parameters for text chunking, including the number of tokens per chunk (100) and the overlap between consecutive chunks (20 tokens). The script searches for PDF documents in a specified directory ("./docs"), loading each document found.

For every document, it utilizes an UnstructuredFileLoader to read the document's content and then employs a `SentenceTransformersTokenTextSplitter` configured with the embedding model and chunking parameters. This splitter divides the document's text into segments that adhere to the specified token limit and overlap criteria, facilitating efficient processing and analysis. Finally, it reports the total execution time for the entire operation, indicating the process's completion and performance efficiency.

In [2]:
embedding_model_name = "intfloat/e5-large-v2"
tokens_per_chunk = 100
chunk_overlap = 20

documents_path = "./docs"
documents = list(Path(documents_path).glob("*.pdf"))

document_chunks = []

start_time = time.time()
for document in documents:
    loader = UnstructuredFileLoader(document.as_posix())
    data = loader.load()

    text_splitter = SentenceTransformersTokenTextSplitter(
        model_name=embedding_model_name,
        tokens_per_chunk=tokens_per_chunk,
        chunk_overlap=chunk_overlap,
    )
    document_chunks += text_splitter.split_documents(data)

print(f"Extracting data from documents and chunking completed. Executed in {time.time() - start_time} seconds") 

  from .autonotebook import tqdm as notebook_tqdm
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


Extracting data from documents and chunking completed. Executed in 343.1030306816101 seconds


## Embedding Generation and Vector Store Integration

This code segment is focused on generating embeddings for document chunks and integrating them into a vector storage solution for efficient similarity search and retrieval. It starts by configuring the embedding model to utilize a CUDA device for acceleration and setting embedding normalization to false. Using the HuggingFaceEmbeddings class, it initializes an embedding model with these specifications, including a dynamically specified model name.

The process records the start time for performance measurement, then proceeds to generate embeddings for the previously created document chunks. It employs the Milvus class to directly load these embeddings into a vector store, creating a new collection named `rafay_ragathon_2024`. Connection parameters for the Milvus server are specified, indicating where the vector store is hosted.

Finally, the script concludes by reporting the time taken to compute the embeddings and load them into the Milvus Vector Store. This setup enables sophisticated query capabilities based on the semantic similarity of the document chunks, enhancing the efficiency and effectiveness of data retrieval and analysis processes.

In [4]:
model_kwargs = {"device": "cpu"}
encode_kwargs = {"normalize_embeddings": False}
hf_embeddings = HuggingFaceEmbeddings(
    model_name=embedding_model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs,
)
start_time = time.time()
vectorstore = Milvus.from_documents(documents=document_chunks, embedding=hf_embeddings,
                                    collection_name="rafay_ragathon_2024",
                                    connection_args={"host": "milvus", "port": "19530"})
print(f"Computing the embeddings for each chunk and loading it to Milvus Vector Store. Executed in {time.time() - start_time} seconds") 

Computing the embeddings for each chunk and loading it to Milvus Vector Store. Executed in 718.3033623695374 seconds


## Large Language Model Integration with Triton Inference Server

This code snippet is focused on integrating a large language model (LLM) with the Triton Inference Server, specifically utilizing NVIDIA's Triton TensorRT for optimized inference. The script begins by importing necessary components from the langchain and langchain_nvidia_trt libraries, which facilitate the creation of prompt templates and the deployment of language models on Triton, respectively.

A server URL and payload (pload) are defined, indicating the address of the Triton server and the configuration for the large language model, including the token limit (500), server URL, and the model name (`ensemble`). This setup implies that an ensemble of models might be used for inference, leveraging Triton's capabilities for managing and scaling AI models.

The TritonTensorRTLLM object is initialized with the provided payload, establishing a connection to the Triton server for executing the language model. Furthermore, a detailed prompt template (LLAMA_PROMPT_TEMPLATE) is crafted. This template instructs the model to use provided context and questions to generate helpful answers, emphasizing that the model should refrain from making up answers if uncertain. The template encapsulates instructions, context, and questions within specific formatting tags to guide the model's response generation.

Lastly, the PromptTemplate.from_template method is utilized to convert the string-based prompt template into a PromptTemplate object, ready for use with the initialized LLM. This configuration allows for dynamic interaction with the language model, facilitating customized and controlled generation of responses based on the input context and questions.

In [5]:
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain.prompts import PromptTemplate
from langchain_nvidia_trt.llms import TritonTensorRTLLM

In [6]:
triton_url = "llm:8001"
pload = {
            'tokens':500,
            'server_url': triton_url,
            'model_name': "ensemble"
}
llm = TritonTensorRTLLM(**pload)


LLAMA_PROMPT_TEMPLATE = (
 "<s>[INST] <<SYS>>"
 "Use the following context to answer the user's question. If you don't know the answer, just say that you don't know, don't try to make up an answer."
 "<</SYS>>"
 "<s>[INST] Context: {context} Question: {question} Only return the helpful answer below and nothing else. Helpful answer:[/INST]"
)

llama_prompt = PromptTemplate.from_template(LLAMA_PROMPT_TEMPLATE)


## Question Answering Pipeline with LLM and Data Retrieval

This code segment constructs and executes a question answering pipeline using a large language model (LLM) integrated with a data retrieval system. The pipeline is defined as a sequence of operations starting with a data retriever (retriever), which is responsible for fetching relevant context for the provided question. The RunnablePassthrough() function serves as a placeholder, ensuring the question passes through the pipeline unchanged.

The operation then proceeds to apply a predefined prompt (LLAMA_PROMPT) to format the question and context appropriately for the LLM. Subsequently, the formatted input is processed by the LLM (llm), which generates a response based on the provided context and the question.

The execution timing starts just before the question processing begins. The pipeline utilizes a streaming approach to handle the question, progressively appending each token generated by the LLM to the output string.

In [10]:
retriever = vectorstore.as_retriever()
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | llama_prompt
    | llm
)
start_time = time.time()
output = ""
question = "What is the total revenue of Apple in 2023?"
for token in chain.stream(question):
    output += token
    
print(f"Question processed and answered in {time.time() - start_time} seconds")
print("Output:")
print(output)

Question processed and answered in 0.6793711185455322 seconds
Output:
  Based on the provided documents, the total revenue of Apple in 2023 is $166,777 million.


## Evaluating Question Answering Performance with RAGAS Metrics

This code segment is designed to evaluate the performance of a question answering system using a dataset of questions and answers. It employs the RAGAS framework to assess the system's ability to provide relevant, faithful, and precise answers based on the retrieved context. Initially, it imports necessary libraries and modules, including pandas for data manipulation and specific metrics from the RAGAS package for evaluation.

**The process starts by loading a CSV file containing question and answer pairs, filtering the dataset for entries where the source chunk type is text and the question type aligns with a "Single-Doc Multi-Chunk RAG" scenario.** It then initializes a retriever from a vector store to fetch relevant documents based on the questions.

For each question in the dataset, the script retrieves contextual information, runs the question through a predefined question answering pipeline (chain), and captures both the generated answer and the inference time. This information is aggregated into a new DataFrame, which is then converted into a Dataset object suitable for evaluation with the RAGAS framework.

The evaluation process employs metrics such as context precision, faithfulness, answer relevancy, and context recall, providing a comprehensive assessment of the system's performance across these dimensions. The results of these metrics are averaged to give a general idea of the system's overall effectiveness in generating accurate and relevant responses.

In [11]:
import pandas as pd
from ragas.metrics import (
    answer_relevancy,
    faithfulness,
    context_recall,
    context_precision,
)
from ragas import evaluate
from datasets import Dataset


In [15]:
qna_data = pd.read_csv("./qna_data.csv")
qna_data = qna_data[qna_data["Source Chunk Type"] == "Text"]


# This single doc multi-chunk RAG is used for sample notebook only. For Hackathon, please use single doc and multiple documents.
qna_data = qna_data[qna_data["Question Type"] == "Single-Doc Multi-Chunk RAG"]

In [13]:
retriever = vectorstore.as_retriever()
_data = []
for idx, row in qna_data.iterrows():

    question = row.iloc[0]
    answer =  row.iloc[4]

    context = []
    _docs = retriever.get_relevant_documents(question)

    for _doc in _docs:
        context.append(_doc.page_content)

    # context = "\n".join(context)
    try:
        start_time = time.time()
        output = ""
        for token in chain.stream(question):
            output += token
            
        inference_time = time.time() - start_time
    
        _data.append([question, answer, context, output, inference_time])
    except Exception as ex:
        print("Error", ex)

In [14]:
df_eval = pd.DataFrame(_data, columns=["question", "ground_truth", "contexts", "answer", "inference_time"])
eval_dataset = Dataset.from_pandas(df_eval)
result = evaluate(eval_dataset, metrics=[
        context_precision,
        faithfulness,
        answer_relevancy,
        context_recall,
    ],)
res = result.to_pandas()[["context_precision","faithfulness","answer_relevancy","context_recall"]]
res.fillna(0.0,inplace=True)
res.mean(axis=0)

Evaluating: 100%|██████████| 112/112 [00:30<00:00,  3.65it/s]


context_precision    0.754960
faithfulness         0.479379
answer_relevancy     0.735310
context_recall       0.633929
dtype: float64

## Judging Criteria
RAG-a-Thon judging will be performed by Rafay & NVIDIA and will be based on 1) the performance and 2) the production readiness of your RAG application. Model performance depends on your underlying RAG strategy, which is typically a combination of data chunking, embedding, retrieval, and text generation models. The performance will be assessed using [Ragas open source library](https://github.com/explodinggradients/ragas) for evaluating RAG. Production readiness will be judged based on the top considerations described here, including: 

## Submission.

To submit your RAG application’s performance, perform the following two actions: 
1. Take a screenshot of your metrics and post it on both Linkedin and X using the hashtag #RafayGTCRagaThon
2. Lastly, email the metrics along with the Jupyter Notebook with your implementation to [ragathon@rafay.co](mailto:ragathon@rafay.co)