# Building Production Ready RAG Pipeline

In this notebook you will learn to build a Production ready RAG Pipeline on `Attention is All You Need` paper. We will use `Sentence Window Index` to build a basic RAG pipeline and iterate over different parameters to make it production ready.

Following are the steps involved:

1. Download Data
2. Load Data
3. Build Evaluation Dataset.
4. Download `RagEvaluatorPack`.
5. Define LLM, Embedding Model.
6. Build RAG with `Sentence Window` approach.
7. Evaluate RAG Pipeline.
8. Create functions to build index, evaluate.
9. Tune different parameters to improve metrics and make it production ready.

## Setup

Install the libraries.

In [43]:
!pip install llama-index pypdf torch sentence-transformers llama-index-embeddings-huggingface

Collecting llama-index-embeddings-huggingface
  Downloading llama_index_embeddings_huggingface-0.2.1-py3-none-any.whl (7.1 kB)
Collecting sentence-transformers
  Downloading sentence_transformers-2.7.0-py3-none-any.whl (171 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m171.5/171.5 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
Collecting minijinja>=1.0 (from huggingface-hub>=0.15.1->sentence-transformers)
  Downloading minijinja-2.0.1-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (853 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m853.2/853.2 kB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: minijinja, sentence-transformers, llama-index-embeddings-huggingface
  Attempting uninstall: sentence-transformers
    Found existing installation: sentence-transformers 3.0.1
    Uninstalling sentence-transformers-3.0.1:
      Successfully uninstalled sentence-transformers-3.0.1
Successfully installed llama-index-embed

## Download `Attention is all you need` paper.

In [44]:
!mkdir './data'
!wget --user-agent="Mozilla" "https://arxiv.org/pdf/1706.03762.pdf" -O "./data/attention_is_all_you_need.pdf"

mkdir: cannot create directory ‘./data’: File exists
--2024-06-08 15:41:28--  https://arxiv.org/pdf/1706.03762.pdf
Resolving arxiv.org (arxiv.org)... 151.101.3.42, 151.101.195.42, 151.101.131.42, ...
Connecting to arxiv.org (arxiv.org)|151.101.3.42|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://arxiv.org/pdf/1706.03762 [following]
--2024-06-08 15:41:28--  http://arxiv.org/pdf/1706.03762
Connecting to arxiv.org (arxiv.org)|151.101.3.42|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2215244 (2.1M) [application/pdf]
Saving to: ‘./data/attention_is_all_you_need.pdf’


2024-06-08 15:41:28 (26.5 MB/s) - ‘./data/attention_is_all_you_need.pdf’ saved [2215244/2215244]



## Set `OpenAI` keys.

In [45]:
import nest_asyncio

nest_asyncio.apply()

from google.colab import userdata
import os
os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY_1')

## Load Data.

We will use first 5 pages and skip paper references in the paper.

In [46]:
from llama_index.core import SimpleDirectoryReader

data = SimpleDirectoryReader('./data/').load_data()

documents = data[:5]

## Generate Evaluation dataset using `RagDatasetGenerator` and `GPT-4`

In [47]:
from llama_index.core import ServiceContext
from llama_index.llms.openai import OpenAI
from llama_index.core.llama_dataset.generator import RagDatasetGenerator

gpt4 = OpenAI(model='gpt-4', temperature=0.1)
service_context_gpt4 = ServiceContext.from_defaults(llm=gpt4)

dataset_generator = RagDatasetGenerator.from_documents(
    documents,
    service_context=service_context_gpt4,
    num_questions_per_chunk=2,
    show_progress=True,
)

eval_dataset = dataset_generator.generate_dataset_from_nodes()

  service_context_gpt4 = ServiceContext.from_defaults(llm=gpt4)


Parsing nodes:   0%|          | 0/5 [00:00<?, ?it/s]


  0%|          | 0/5 [00:00<?, ?it/s][A
 20%|██        | 1/5 [00:02<00:11,  2.85s/it][A
 60%|██████    | 3/5 [00:03<00:02,  1.06s/it][A
 80%|████████  | 4/5 [00:03<00:00,  1.28it/s][A
100%|██████████| 5/5 [00:05<00:00,  1.16s/it]

  0%|          | 0/2 [00:00<?, ?it/s][A
 50%|█████     | 1/2 [00:08<00:08,  8.67s/it][A
100%|██████████| 2/2 [00:09<00:00,  4.58s/it]

  0%|          | 0/2 [00:00<?, ?it/s][A
 50%|█████     | 1/2 [00:06<00:06,  6.67s/it][A
100%|██████████| 2/2 [00:10<00:00,  5.14s/it]

  0%|          | 0/2 [00:00<?, ?it/s][A
 50%|█████     | 1/2 [00:03<00:03,  3.50s/it][A
100%|██████████| 2/2 [00:07<00:00,  3.90s/it]

  0%|          | 0/2 [00:00<?, ?it/s][A
100%|██████████| 2/2 [00:07<00:00,  3.74s/it]

  0%|          | 0/2 [00:00<?, ?it/s][A
 50%|█████     | 1/2 [00:05<00:05,  5.17s/it][A
100%|██████████| 2/2 [00:06<00:00,  3.41s/it]


## Download `RagEvaluatorPack` for evaluation.

In [48]:
from llama_index.core.llama_pack import download_llama_pack

RagEvaluatorPack = download_llama_pack(
  "RagEvaluatorPack", "./rag_evaluator_pack"
)

## Define LLM.

In [49]:
# from llama_index.llms. import OpenAI

llm = OpenAI(model="gpt-3.5-turbo", temperature=0.1)

## Define Embedding Model

In [50]:
embed_model = "local:BAAI/bge-small-en-v1.5"

## Build RAG pipeline with `SentenceWindow`

In [51]:
from llama_index.core.node_parser import SentenceWindowNodeParser

# create the sentence window node parser w/ default settings
node_parser = SentenceWindowNodeParser.from_defaults(
    window_size=1,
    window_metadata_key="window",
    original_text_metadata_key="original_text",
)

In [52]:
from llama_index.core import ServiceContext

sentence_context = ServiceContext.from_defaults(
    llm=llm,
    embed_model=embed_model,
    node_parser=node_parser,
)

  sentence_context = ServiceContext.from_defaults(
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/94.8k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [54]:
from llama_index.core import Document

document = Document(text="\n\n".join([doc.text for doc in documents]))

In [55]:
from llama_index.core import VectorStoreIndex

sentence_index = VectorStoreIndex.from_documents(
    [document], service_context=sentence_context
)

In [56]:
from llama_index.core.indices.postprocessor import SentenceTransformerRerank

rerank = SentenceTransformerRerank(
    top_n=2, model="BAAI/bge-reranker-base"
)



config.json:   0%|          | 0.00/799 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/443 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/279 [00:00<?, ?B/s]

In [57]:
from llama_index.core.indices.postprocessor import MetadataReplacementPostProcessor

postproc = MetadataReplacementPostProcessor(
    target_metadata_key="window"
)

In [58]:
query_engine = sentence_index.as_query_engine(
    similarity_top_k=2, node_postprocessors=[postproc, rerank]
)

In [59]:
response = query_engine.query('is the paper from google research?')
print(response)

Yes.


## Evaluate RAG pipeline

In [60]:
rag_evaluator_pack = RagEvaluatorPack(
    rag_dataset=eval_dataset,
    query_engine=query_engine
)

base_benchmark = await rag_evaluator_pack.arun(
    batch_size=10,  # batches the number of openai api calls to make
    sleep_time_in_seconds=1,  # seconds to sleep before making an api call
)



Batch processing of predictions:   0%|          | 0/10 [00:00<?, ?it/s][A
Batch processing of predictions:  10%|█         | 1/10 [00:15<02:19, 15.55s/it][A
Batch processing of predictions:  20%|██        | 2/10 [00:16<00:54,  6.77s/it][A
Batch processing of predictions:  30%|███       | 3/10 [00:16<00:26,  3.74s/it][A
Batch processing of predictions:  50%|█████     | 5/10 [00:16<00:08,  1.66s/it][A
Batch processing of predictions:  70%|███████   | 7/10 [00:16<00:02,  1.03it/s][A
Batch processing of predictions: 100%|██████████| 10/10 [00:17<00:00,  1.74s/it]

Batch processing of evaluations:   0%|          | 0/6.5 [00:00<?, ?it/s][A
Batch processing of evaluations:  15%|█▌        | 1/6.5 [00:06<00:35,  6.53s/it][A
Batch processing of evaluations:  15%|█▌        | 1/6.5 [00:06<00:35,  6.53s/it][A
Batch processing of evaluations:  46%|████▌     | 3/6.5 [00:14<00:16,  4.78s/it][A
Batch processing of evaluations:  46%|████▌     | 3/6.5 [00:14<00:16,  4.78s/it][A
Batch processin

In [61]:
base_benchmark

rag,base_rag
metrics,Unnamed: 1_level_1
mean_correctness_score,3.75
mean_relevancy_score,0.9
mean_faithfulness_score,1.0
mean_context_similarity_score,0.940403


## Create Functions to build RAG pipeline and Evaluation.

This will make the process of iterating easier for evaluation.

In [62]:
def build_index(
    documents,
    llm=OpenAI(model='gpt-3.5-turbo', temperature=0.1),
    embed_model="local:BAAI/bge-small-en-v1.5",
    sentence_window_size=3,
):
    # create the sentence window node parser w/ default settings
    node_parser = SentenceWindowNodeParser.from_defaults(
        window_size=sentence_window_size,
        window_metadata_key="window",
        original_text_metadata_key="original_text",
    )
    sentence_context = ServiceContext.from_defaults(
        llm=llm,
        embed_model=embed_model,
        node_parser=node_parser,
    )
    sentence_index = VectorStoreIndex.from_documents(
        documents, service_context=sentence_context
    )

    return sentence_index


def setup_query_engine(
    sentence_index, similarity_top_k=2, rerank_top_n=2, is_rerank = False
):
    # define postprocessors
    postproc = MetadataReplacementPostProcessor(target_metadata_key="window")
    if is_rerank:
      rerank = SentenceTransformerRerank(
          top_n=rerank_top_n, model="BAAI/bge-reranker-base"
      )
      query_engine = sentence_index.as_query_engine(
          similarity_top_k=similarity_top_k, node_postprocessors=[postproc, rerank]
      )
    else:
      query_engine = sentence_index.as_query_engine(
          similarity_top_k=similarity_top_k, node_postprocessors=[postproc]
      )
    return query_engine

async def evaluate(query_engine):
  rag_evaluator_pack = RagEvaluatorPack(
      rag_dataset=eval_dataset,
      query_engine=query_engine
  )

  benchmark_df = await rag_evaluator_pack.arun(
      batch_size=10,  # batches the number of openai api calls to make
      sleep_time_in_seconds=1,  # seconds to sleep before making an api call
  )

  return benchmark_df

def build_index_query_engine(window_size,
                             similarity_top_k,
                             rerank_similarity_top_k,
                             rerank_top_k):
  sentence_index = build_index(
      [document],
      sentence_window_size=window_size,
  )

  query_engine = setup_query_engine(sentence_index,
                                      similarity_top_k=similarity_top_k)

  query_engine_rerank = setup_query_engine(sentence_index,
                                            similarity_top_k=rerank_similarity_top_k,
                                            rerank_top_n=rerank_top_k,
                                            is_rerank=True)
  return sentence_index, query_engine, query_engine_rerank

In [63]:
index, query_engine, query_engine_rerank = build_index_query_engine(1, 2, 4, 2)

  sentence_context = ServiceContext.from_defaults(


In [64]:
base_benchmark = await evaluate(query_engine)


Batch processing of predictions:   0%|          | 0/10 [00:00<?, ?it/s][A
Batch processing of predictions:  10%|█         | 1/10 [00:03<00:28,  3.15s/it][A
Batch processing of predictions:  20%|██        | 2/10 [00:03<00:11,  1.40s/it][A
Batch processing of predictions:  30%|███       | 3/10 [00:03<00:05,  1.17it/s][A
Batch processing of predictions:  50%|█████     | 5/10 [00:03<00:02,  2.39it/s][A
Batch processing of predictions:  60%|██████    | 6/10 [00:03<00:01,  2.63it/s][A
Batch processing of predictions: 100%|██████████| 10/10 [00:04<00:00,  2.13it/s]

Batch processing of evaluations:   0%|          | 0/6.5 [00:00<?, ?it/s][A
Batch processing of evaluations:  15%|█▌        | 1/6.5 [00:05<00:30,  5.49s/it][A
Batch processing of evaluations:  15%|█▌        | 1/6.5 [00:05<00:30,  5.49s/it][A
Batch processing of evaluations:  46%|████▌     | 3/6.5 [00:12<00:14,  4.00s/it][A
Batch processing of evaluations:  46%|████▌     | 3/6.5 [00:12<00:14,  4.00s/it][A
Batch processin

In [65]:
base_benchmark_rerank = await evaluate(query_engine_rerank)


Batch processing of predictions:   0%|          | 0/10 [00:00<?, ?it/s][A
Batch processing of predictions:  10%|█         | 1/10 [00:28<04:18, 28.77s/it][A
Batch processing of predictions:  30%|███       | 3/10 [00:28<00:52,  7.53s/it][A
Batch processing of predictions:  50%|█████     | 5/10 [00:29<00:18,  3.78s/it][A
Batch processing of predictions:  80%|████████  | 8/10 [00:29<00:03,  1.83s/it][A
Batch processing of predictions: 100%|██████████| 10/10 [00:30<00:00,  3.04s/it]

Batch processing of evaluations:   0%|          | 0/6.5 [00:00<?, ?it/s][A
Batch processing of evaluations:  15%|█▌        | 1/6.5 [00:07<00:43,  7.90s/it][A
Batch processing of evaluations:  15%|█▌        | 1/6.5 [00:07<00:43,  7.90s/it][A
Batch processing of evaluations:  46%|████▌     | 3/6.5 [00:15<00:17,  4.86s/it][A
Batch processing of evaluations:  46%|████▌     | 3/6.5 [00:15<00:17,  4.86s/it][A
Batch processing of evaluations:  77%|███████▋  | 5/6.5 [00:21<00:06,  4.02s/it][A
Batch processi

In [66]:
base_benchmark

rag,base_rag
metrics,Unnamed: 1_level_1
mean_correctness_score,4.15
mean_relevancy_score,1.0
mean_faithfulness_score,1.0
mean_context_similarity_score,0.930138


In [67]:
base_benchmark_rerank

rag,base_rag
metrics,Unnamed: 1_level_1
mean_correctness_score,3.65
mean_relevancy_score,1.0
mean_faithfulness_score,1.0
mean_context_similarity_score,0.935733


From the metrics we can observe that `correctness` metric is lower (maximum 5) and including `reranker` improved metrics though it decreased the `correctness` metric.

Interesting to see there are no hallucinations as `faithfulness` metric is 1.0

## Tune parameters to make it production ready.

Let's aim to get `correctness` score of `4.5` and `relevancy` score of more than `0.9`.

### Experiment 1:

Let's increase window size and see if we can improve correctness as it gives more surrounding context.

In [68]:
index, query_engine, query_engine_rerank = build_index_query_engine(3, 2, 4, 2)

  sentence_context = ServiceContext.from_defaults(


In [69]:
benchmark = await evaluate(query_engine)


Batch processing of predictions:   0%|          | 0/10 [00:00<?, ?it/s][A
Batch processing of predictions:  10%|█         | 1/10 [00:03<00:27,  3.07s/it][A
Batch processing of predictions:  20%|██        | 2/10 [00:03<00:11,  1.41s/it][A
Batch processing of predictions:  30%|███       | 3/10 [00:03<00:05,  1.22it/s][A
Batch processing of predictions:  50%|█████     | 5/10 [00:03<00:02,  2.49it/s][A
Batch processing of predictions:  70%|███████   | 7/10 [00:03<00:00,  3.63it/s][A
Batch processing of predictions: 100%|██████████| 10/10 [00:04<00:00,  2.03it/s]

Batch processing of evaluations:   0%|          | 0/6.5 [00:00<?, ?it/s][A
Batch processing of evaluations:  15%|█▌        | 1/6.5 [00:05<00:28,  5.22s/it][A
Batch processing of evaluations:  15%|█▌        | 1/6.5 [00:05<00:28,  5.22s/it][A
Batch processing of evaluations:  46%|████▌     | 3/6.5 [00:12<00:13,  3.91s/it][A
Batch processing of evaluations:  46%|████▌     | 3/6.5 [00:12<00:13,  3.91s/it][A
Batch processin

In [70]:
benchmark_rerank = await evaluate(query_engine_rerank)


Batch processing of predictions:   0%|          | 0/10 [00:00<?, ?it/s][A
Batch processing of predictions:  10%|█         | 1/10 [00:57<08:35, 57.31s/it][A
Batch processing of predictions:  40%|████      | 4/10 [00:57<01:05, 10.92s/it][A
Batch processing of predictions:  70%|███████   | 7/10 [00:57<00:15,  5.08s/it][A
Batch processing of predictions: 100%|██████████| 10/10 [00:58<00:00,  5.89s/it]

Batch processing of evaluations:   0%|          | 0/6.5 [00:00<?, ?it/s][A
Batch processing of evaluations:  15%|█▌        | 1/6.5 [00:05<00:30,  5.54s/it][A
Batch processing of evaluations:  15%|█▌        | 1/6.5 [00:05<00:30,  5.54s/it][A
Batch processing of evaluations:  46%|████▌     | 3/6.5 [00:12<00:13,  3.99s/it][A
Batch processing of evaluations:  46%|████▌     | 3/6.5 [00:12<00:13,  3.99s/it][A
Batch processing of evaluations:  77%|███████▋  | 5/6.5 [00:18<00:05,  3.52s/it][A
Batch processing of evaluations:  77%|███████▋  | 5/6.5 [00:18<00:05,  3.52s/it][A
Batch process

In [71]:
benchmark

rag,base_rag
metrics,Unnamed: 1_level_1
mean_correctness_score,4.2
mean_relevancy_score,1.0
mean_faithfulness_score,1.0
mean_context_similarity_score,0.946689


In [72]:
benchmark_rerank

rag,base_rag
metrics,Unnamed: 1_level_1
mean_correctness_score,4.1
mean_relevancy_score,1.0
mean_faithfulness_score,1.0
mean_context_similarity_score,0.954488


The metrics did improve.

### Experiment 2:

Let'r increase `similarity_top_k` and `reranker_top_n` values and see if getting more relevant contexts improves the result.

In [73]:
index, query_engine, query_engine_rerank = build_index_query_engine(3, 4, 8, 4)

  sentence_context = ServiceContext.from_defaults(


In [74]:
benchmark = await evaluate(query_engine)


Batch processing of predictions:   0%|          | 0/10 [00:00<?, ?it/s][A
Batch processing of predictions:  10%|█         | 1/10 [00:03<00:28,  3.20s/it][A
Batch processing of predictions:  30%|███       | 3/10 [00:03<00:06,  1.14it/s][A
Batch processing of predictions:  40%|████      | 4/10 [00:03<00:03,  1.56it/s][A
Batch processing of predictions:  50%|█████     | 5/10 [00:03<00:02,  2.16it/s][A
Batch processing of predictions:  70%|███████   | 7/10 [00:03<00:00,  3.69it/s][A
Batch processing of predictions: 100%|██████████| 10/10 [00:04<00:00,  2.14it/s]

Batch processing of evaluations:   0%|          | 0/6.5 [00:00<?, ?it/s][A
Batch processing of evaluations:  15%|█▌        | 1/6.5 [00:05<00:29,  5.35s/it][A
Batch processing of evaluations:  15%|█▌        | 1/6.5 [00:05<00:29,  5.35s/it][A
Batch processing of evaluations:  46%|████▌     | 3/6.5 [00:13<00:14,  4.26s/it][A
Batch processing of evaluations:  46%|████▌     | 3/6.5 [00:13<00:14,  4.26s/it][A
Batch processin

In [75]:
benchmark_rerank = await evaluate(query_engine_rerank)


Batch processing of predictions:   0%|          | 0/10 [00:00<?, ?it/s][A
Batch processing of predictions:  10%|█         | 1/10 [02:03<18:33, 123.75s/it][A
Batch processing of predictions:  30%|███       | 3/10 [02:04<03:45, 32.19s/it] [A
Batch processing of predictions:  40%|████      | 4/10 [02:04<02:06, 21.10s/it][A
Batch processing of predictions:  50%|█████     | 5/10 [02:04<01:10, 14.19s/it][A
Batch processing of predictions:  60%|██████    | 6/10 [02:04<00:38,  9.75s/it][A
Batch processing of predictions:  70%|███████   | 7/10 [02:05<00:20,  6.83s/it][A
Batch processing of predictions:  80%|████████  | 8/10 [02:05<00:09,  4.85s/it][A
Batch processing of predictions: 100%|██████████| 10/10 [02:06<00:00, 12.66s/it]

Batch processing of evaluations:   0%|          | 0/6.5 [00:00<?, ?it/s][A
Batch processing of evaluations:  15%|█▌        | 1/6.5 [00:06<00:35,  6.40s/it][A
Batch processing of evaluations:  15%|█▌        | 1/6.5 [00:06<00:35,  6.40s/it][A
Batch processin

In [76]:
benchmark

rag,base_rag
metrics,Unnamed: 1_level_1
mean_correctness_score,4.3
mean_relevancy_score,0.9
mean_faithfulness_score,1.0
mean_context_similarity_score,0.948439


In [77]:
benchmark_rerank

rag,base_rag
metrics,Unnamed: 1_level_1
mean_correctness_score,4.2
mean_relevancy_score,1.0
mean_faithfulness_score,1.0
mean_context_similarity_score,0.953095


We are close to our goal. We reached a `correctness` score of `4.275` and `relevancy` score of 0.9. The `context similarity score` also improved

### Experiment 3:

Let's now increase window size and see if it improves the metrics.

In [78]:
index, query_engine, query_engine_rerank = build_index_query_engine(5, 4, 8, 4)

  sentence_context = ServiceContext.from_defaults(


In [79]:
benchmark = await evaluate(query_engine)


Batch processing of predictions:   0%|          | 0/10 [00:00<?, ?it/s][A
Batch processing of predictions:  10%|█         | 1/10 [00:03<00:33,  3.72s/it][A
Batch processing of predictions:  20%|██        | 2/10 [00:03<00:12,  1.62s/it][A
Batch processing of predictions:  40%|████      | 4/10 [00:04<00:04,  1.37it/s][A
Batch processing of predictions:  70%|███████   | 7/10 [00:04<00:01,  2.59it/s][A
Batch processing of predictions:  80%|████████  | 8/10 [00:04<00:00,  2.72it/s][A
Batch processing of predictions: 100%|██████████| 10/10 [00:05<00:00,  1.79it/s]

Batch processing of evaluations:   0%|          | 0/6.5 [00:00<?, ?it/s][A
Batch processing of evaluations:  15%|█▌        | 1/6.5 [00:06<00:36,  6.62s/it][A
Batch processing of evaluations:  15%|█▌        | 1/6.5 [00:06<00:36,  6.62s/it][A
Batch processing of evaluations:  46%|████▌     | 3/6.5 [00:13<00:14,  4.12s/it][A
Batch processing of evaluations:  46%|████▌     | 3/6.5 [00:13<00:14,  4.12s/it][A
Batch processin

In [80]:
benchmark_rerank = await evaluate(query_engine_rerank)


Batch processing of predictions:   0%|          | 0/10 [00:00<?, ?it/s][A
Batch processing of predictions:  10%|█         | 1/10 [03:25<30:50, 205.56s/it][A
Batch processing of predictions:  20%|██        | 2/10 [03:25<11:17, 84.74s/it] [A
Batch processing of predictions:  30%|███       | 3/10 [03:25<05:22, 46.13s/it][A
Batch processing of predictions:  40%|████      | 4/10 [03:26<02:47, 27.96s/it][A
Batch processing of predictions:  50%|█████     | 5/10 [03:26<01:30, 18.00s/it][A
Batch processing of predictions:  60%|██████    | 6/10 [03:26<00:47, 11.93s/it][A
Batch processing of predictions:  90%|█████████ | 9/10 [03:27<00:04,  4.97s/it][A
Batch processing of predictions: 100%|██████████| 10/10 [03:27<00:00, 20.76s/it]

Batch processing of evaluations:   0%|          | 0/6.5 [00:00<?, ?it/s][A
Batch processing of evaluations:  15%|█▌        | 1/6.5 [00:04<00:25,  4.71s/it][A
Batch processing of evaluations:  15%|█▌        | 1/6.5 [00:04<00:25,  4.71s/it][A
Batch processin

In [81]:
benchmark

rag,base_rag
metrics,Unnamed: 1_level_1
mean_correctness_score,4.15
mean_relevancy_score,1.0
mean_faithfulness_score,1.0
mean_context_similarity_score,0.953137


In [82]:
benchmark_rerank

rag,base_rag
metrics,Unnamed: 1_level_1
mean_correctness_score,4.15
mean_relevancy_score,1.0
mean_faithfulness_score,1.0
mean_context_similarity_score,0.95222


We have reached a `relevency` score of `1.0`.

### Experiment 4:

Let'r increase `similarity_top_k` and `reranker_top_n` values and see if getting more relevant contexts improves the result.

In [83]:
index, query_engine, query_engine_rerank = build_index_query_engine(5, 6, 12, 6)

  sentence_context = ServiceContext.from_defaults(


In [84]:
benchmark_rerank = await evaluate(query_engine_rerank)


Batch processing of predictions:   0%|          | 0/10 [00:00<?, ?it/s][A
Batch processing of predictions:  10%|█         | 1/10 [04:59<44:54, 299.33s/it][A
Batch processing of predictions:  20%|██        | 2/10 [04:59<16:27, 123.40s/it][A
Batch processing of predictions:  50%|█████     | 5/10 [04:59<02:55, 35.14s/it] [A
Batch processing of predictions:  60%|██████    | 6/10 [05:00<01:44, 26.25s/it][A
Batch processing of predictions:  80%|████████  | 8/10 [05:00<00:30, 15.24s/it][A
Batch processing of predictions: 100%|██████████| 10/10 [05:00<00:00, 30.06s/it]

Batch processing of evaluations:   0%|          | 0/6.5 [00:00<?, ?it/s][A
Batch processing of evaluations:  15%|█▌        | 1/6.5 [00:04<00:27,  4.93s/it][A
Batch processing of evaluations:  15%|█▌        | 1/6.5 [00:04<00:27,  4.93s/it][A
Batch processing of evaluations:  46%|████▌     | 3/6.5 [00:10<00:12,  3.49s/it][A
Batch processing of evaluations:  46%|████▌     | 3/6.5 [00:10<00:12,  3.49s/it][A
Batch proces

In [85]:
benchmark_rerank

rag,base_rag
metrics,Unnamed: 1_level_1
mean_correctness_score,4.4
mean_relevancy_score,1.0
mean_faithfulness_score,1.0
mean_context_similarity_score,0.953447


We have reached our goal of `correctness` score of `4.5` and `relevancy` score of `1.0` (>0.9).

## Observation:

In this project, we looked into building RAG Pipeline, evaluation dataset and tuning different parameters to make it production ready. It should be observed that `reranker` improved metrics in most of the experiments.

Please do remember that we have various other metrics like `chunk_size`, `chunk_overlap`, `embedding model`, `LLM` to experiment.