# Building Production Ready RAG Pipeline

In this notebook you will learn to build a Production ready RAG Pipeline on `Attention is All You Need` paper. We will use `Sentence Window Index` to build a basic RAG pipeline and iterate over different parameters to make it production ready.

Following are the steps involved:

1. Download Data
2. Load Data
3. Build Evaluation Dataset.
4. Download `RagEvaluatorPack`.
5. Define LLM, Embedding Model.
6. Build RAG with `Sentence Window` approach.
7. Evaluate RAG Pipeline.
8. Create functions to build index, evaluate.
9. Tune different parameters to improve metrics and make it production ready.

In [None]:
# !pip install openai
# !pip install llama_index
# !pip install llama_index.embeddings.huggingface
# !pip install llama_index.packs.rag_evaluator

# Updated code

In [None]:
# attach to the same event-loop
import nest_asyncio

nest_asyncio.apply()

## Set `OpenAI` keys.

In [None]:
## ------NOTE: Use this piece of code when you are running the code on your local machine##-------
import os
from dotenv import load_dotenv, find_dotenv
load_dotenv('D:/Learning/Gen AI/Building production ready RAG systems using LlamaIndex/API Keys/.env')
OPENAI_API_KEY = os.environ['OPENAI_API_KEY']

# ## ------NOTE: Use this piece of code when you are running the code on Google colab (Assign the API key in the secrets tab on the left)##-------
# from google.colab import userdata
# import openai
# OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')
# openai.api_key = OPENAI_API_KEY

## Hugging face login token:
# hugging_face_token = userdata.get('HUGGING_FACE_TOKEN')

## Download and Load Data.
* Download `Attention is all you need` paper.
* We will use first 5 pages and skip paper references in the paper.

In [None]:
!mkdir './data'
!wget --user-agent="Mozilla" "https://arxiv.org/pdf/1706.03762.pdf" -O "./data/attention_is_all_you_need.pdf"

In [None]:
from llama_index.core import SimpleDirectoryReader

# Loading the document and only considering the first 5 pages:
data = SimpleDirectoryReader('./data/').load_data()
documents = data[:5]

## Generate Evaluation dataset using `RagDatasetGenerator` and `GPT-4`

In [None]:
from llama_index.llms.openai import OpenAI
from llama_index.core.llama_dataset.generator import RagDatasetGenerator

# Configuring LLM for the RAG dataset generator:
llm_gpt4 = OpenAI(model='gpt-4-0125-preview', temperature=0.1)

# Creating dataset using RAG dataset generator:
dataset_generator = RagDatasetGenerator.from_documents(documents,
                                                       llm = llm_gpt4,
                                                       num_questions_per_chunk = 2,
                                                       show_progress = True)

eval_dataset = dataset_generator.generate_dataset_from_nodes()

## Define LLM

In [None]:
llm = OpenAI(model="gpt-3.5-turbo", temperature=0.1)

## Define Embedding Model

In [None]:
# embed_model = "local:BAAI/bge-small-en-v1.5"

from llama_index.embeddings.huggingface import HuggingFaceEmbedding
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

## Build RAG pipeline with `SentenceWindow`

In [None]:
from llama_index.core.node_parser import SentenceWindowNodeParser

# create the sentence window node parser w/ default settings
node_parser = SentenceWindowNodeParser.from_defaults(window_size=1,
                                                     window_metadata_key="window",
                                                     original_text_metadata_key="original_text")

In [None]:
# Joining all the documents into a single document:
from llama_index.core import Document

document = Document(text="\n\n".join([doc.text for doc in documents]))

In [None]:
# Creating vector store index:
from llama_index.core import VectorStoreIndex

sentence_index = VectorStoreIndex.from_documents([document], embed_model = embed_model, llm = llm)

In [None]:
# configuring sentence transformer rerank using BAAI/bge-reranker-base model:
from llama_index.core.indices.postprocessor import SentenceTransformerRerank

rerank = SentenceTransformerRerank(top_n=2, model="BAAI/bge-reranker-base")

In [None]:
# Configuring Metadata replacement post processor:
from llama_index.core.indices.postprocessor import MetadataReplacementPostProcessor

postproc = MetadataReplacementPostProcessor(target_metadata_key="window")

In [None]:
# Query engine is created using the sentence index, postproc and rerank:
query_engine = sentence_index.as_query_engine(similarity_top_k=2, node_postprocessors=[postproc, rerank])

In [None]:
# Response generation using the above created query engine:
response = query_engine.query('is the paper from google research?')
print(response)

## Evaluate RAG pipeline

In [None]:
from llama_index.packs.rag_evaluator import RagEvaluatorPack

rag_evaluator_pack = RagEvaluatorPack(rag_dataset = eval_dataset,
                                      query_engine = query_engine,
                                      judge_llm = llm_gpt4)

base_benchmark = await rag_evaluator_pack.arun(batch_size=10,  # batches the number of openai api calls to make
                                               sleep_time_in_seconds=1)  # seconds to sleep before making an api call

In [None]:
base_benchmark

## Create Functions to build RAG pipeline and Evaluation.

This will make the process of iterating easier for evaluation.

In [None]:
## Function to create index:
def build_index(documents, llm = OpenAI(model='gpt-3.5-turbo', temperature=0.1),
                embed_model = "local:BAAI/bge-small-en-v1.5", sentence_window_size = 3):

    # create the sentence window node parser w/ default settings:
    node_parser = SentenceWindowNodeParser.from_defaults(window_size = sentence_window_size,
                                                         window_metadata_key = "window",
                                                         original_text_metadata_key = "original_text")

    # Creating index using the embed model and the LLM on documnets:
    sentence_index = VectorStoreIndex.from_documents(documents, embed_model = embed_model, llm = llm)

    return sentence_index

# --------------------------------------------------

## Function to create a query engine:
def setup_query_engine(sentence_index, similarity_top_k=2, rerank_top_n=2, is_rerank = False):

    # Defininig Metadata replacement post processor:
    postproc = MetadataReplacementPostProcessor(target_metadata_key="window")

    if is_rerank:
      # Defining re-rank post processor:
      rerank = SentenceTransformerRerank(top_n = rerank_top_n,
                                         model = "BAAI/bge-reranker-base")
      # Building query engine using re-rank and metadata postproc:
      query_engine = sentence_index.as_query_engine(similarity_top_k = similarity_top_k,
                                                    node_postprocessors = [postproc, rerank])
    else:
      # Building query engine using just metadata postproc:
      query_engine = sentence_index.as_query_engine(similarity_top_k = similarity_top_k,
                                                    node_postprocessors = [postproc])
    return query_engine

# --------------------------------------------------

## Final function to create a RAG system (Index + Query engine):
def build_index_query_engine(window_size, similarity_top_k, rerank_similarity_top_k, rerank_top_k):

  # Creating index using 'build_index' funtion:
  sentence_index = build_index([document], sentence_window_size = window_size)

  # Creating query engine without re-rank using 'setup_query_engine' function:
  query_engine = setup_query_engine(sentence_index,
                                    similarity_top_k = similarity_top_k)

  # Creating query engine with re-rank using 'setup_query_engine' function:
  query_engine_rerank = setup_query_engine(sentence_index,
                                            similarity_top_k = rerank_similarity_top_k,
                                            rerank_top_n = rerank_top_k,
                                            is_rerank = True)

  return sentence_index, query_engine, query_engine_rerank

# --------------------------------------------------

## Function to evaluate RAG system (index + query engine):
async def evaluate_rag(query_engine):

  # Defining RAG evaluator pack:
  rag_evaluator_pack = RagEvaluatorPack(rag_dataset = eval_dataset,
                                        query_engine = query_engine)
  # Running RAG evaluator pack:
  benchmark_df = await rag_evaluator_pack.arun(batch_size = 10,  # batches the number of openai api calls to make
                                               sleep_time_in_seconds = 2)  # seconds to sleep before making an api call

  return benchmark_df


In [None]:
# Building
index, query_engine, query_engine_rerank = build_index_query_engine(1, 2, 4, 2)

In [None]:
# Evaluating query engine without re-rank:
base_benchmark = await evaluate_rag(query_engine)
display(base_benchmark)

print('\n')

# Evaluating query engine with re-rank:
rerank_benchmark = await evaluate_rag(query_engine_rerank)
display(rerank_benchmark)

From the metrics we can observe that `correctness` metric is lower (maximum 5) and including `reranker` improved metrics though it decreased the `correctness` metric.

Interesting to see there are no hallucinations as `faithfulness` metric is 1.0

# Tune parameters to make RAG system production ready.

* Let's aim to get `correctness` score of `4.4` and `relevancy` score of more than `0.9`.

### Experiment 1:

Let's increase window size and see if we can improve correctness as it gives more surrounding context.

In [None]:
# Building re-rank and non-rerank query engines:
index, query_engine, query_engine_rerank = build_index_query_engine(3, 2, 4, 2)

In [None]:
# Evaluating query engine without re-rank:
base_benchmark = await evaluate_rag(query_engine)
display(base_benchmark)

print('\n')

# Evaluating query engine with re-rank:
rerank_benchmark = await evaluate_rag(query_engine_rerank)
display(rerank_benchmark)

### Experiment 2:

Let'r increase `similarity_top_k` and `reranker_top_n` values and see if getting more relevant contexts improves the result.

In [None]:
# Building re-rank and non-rerank query engines:
index, query_engine, query_engine_rerank = build_index_query_engine(3, 4, 8, 4)

In [None]:
# Evaluating query engine without re-rank:
base_benchmark = await evaluate_rag(query_engine)
display(base_benchmark)

print('\n')

# Evaluating query engine with re-rank:
rerank_benchmark = await evaluate_rag(query_engine_rerank)
display(rerank_benchmark)

We are close to our goal. We reached a `correctness` score of `4.25` and `relevancy` score of 1.0. The `context similarity score` also improved

### Experiment 3:

Let's now increase window size and see if it improves the metrics.

In [None]:
# Building re-rank and non-rerank query engines:
index, query_engine, query_engine_rerank = build_index_query_engine(5, 4, 8, 4)

In [None]:
# Evaluating query engine without re-rank:
base_benchmark = await evaluate_rag(query_engine)
display(base_benchmark)

print('\n')

# Evaluating query engine with re-rank:
rerank_benchmark = await evaluate_rag(query_engine_rerank)
display(rerank_benchmark)

### Experiment 4:

Let'r increase `similarity_top_k` and `reranker_top_n` values and see if getting more relevant contexts improves the result.

In [None]:
# Building re-rank and non-rerank query engines:
index, query_engine, query_engine_rerank = build_index_query_engine(5, 6, 12, 6)

In [None]:
# Evaluating query engine without re-rank:
base_benchmark = await evaluate_rag(query_engine)
display(base_benchmark)

print('\n')

# Evaluating query engine with re-rank:
rerank_benchmark = await evaluate_rag(query_engine_rerank)
display(rerank_benchmark)

We have reached our goal of `correctness` score of `4.4` and `relevancy` score of `1.0` (>0.9).

## Observation:

In this project, we looked into building RAG Pipeline, evaluation dataset and tuning different parameters to make it production ready. It should be observed that `reranker` improved metrics in most of the experiments.

Please do remember that we have various other metrics like `chunk_size`, `chunk_overlap`, `embedding model`, `LLM` to experiment.