# Building Production Ready RAG Pipeline

In this notebook you will learn to build a Production ready RAG Pipeline on `Attention is All You Need` paper. We will use `Sentence Window Index` to build a basic RAG pipeline and iterate over different parameters to make it production ready.

Following are the steps involved:

1. Download Data
2. Load Data
3. Build Evaluation Dataset.
4. Download `RagEvaluatorPack`.
5. Define LLM, Embedding Model.
6. Build RAG with `Sentence Window` approach.
7. Evaluate RAG Pipeline.
8. Create functions to build index, evaluate.
9. Tune different parameters to improve metrics and make it production ready.

## Setup

Install the libraries.

In [None]:
!pip install llama-index pypdf torch sentence-transformers

Collecting llama-index
  Downloading llama_index-0.9.21-py3-none-any.whl (15.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.7/15.7 MB[0m [31m52.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pypdf
  Downloading pypdf-3.17.4-py3-none-any.whl (278 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m278.2/278.2 kB[0m [31m23.8 MB/s[0m eta [36m0:00:00[0m
Collecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting beautifulsoup4<5.0.0,>=4.12.2 (from llama-index)
  Downloading beautifulsoup4-4.12.2-py3-none-any.whl (142 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.0/143.0 kB[0m [31m11.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting dataclasses-json (from llama-index)
  Downloading dataclasses_json

## Download `Attention is all you need` paper.

In [None]:
!mkdir './data'
!wget --user-agent="Mozilla" "https://arxiv.org/pdf/1706.03762.pdf" -O "./data/attention_is_all_you_need.pdf"

--2023-12-25 10:28:54--  https://arxiv.org/pdf/1706.03762.pdf
Resolving arxiv.org (arxiv.org)... 151.101.195.42, 151.101.131.42, 151.101.67.42, ...
Connecting to arxiv.org (arxiv.org)|151.101.195.42|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2215244 (2.1M) [application/pdf]
Saving to: ‘./data/attention_is_all_you_need.pdf’


2023-12-25 10:28:54 (29.9 MB/s) - ‘./data/attention_is_all_you_need.pdf’ saved [2215244/2215244]



## Set `OpenAI` keys.

In [None]:
import nest_asyncio

nest_asyncio.apply()

import os
os.environ['OPENAI_API_KEY'] = 'YOUR OPENAI API KEY'

## Load Data.

We will use first 10 pages and skip paper references in the paper.

In [None]:
from llama_index import SimpleDirectoryReader

data = SimpleDirectoryReader('./data/').load_data()

documents = data[:10]

## Generate Evaluation dataset using `RagDatasetGenerator` and `GPT-4`

In [None]:
from llama_index import ServiceContext
from llama_index.llms import OpenAI
from llama_index.llama_dataset.generator import RagDatasetGenerator

gpt4 = OpenAI(model='gpt-4', temperature=0.1)
service_context_gpt4 = ServiceContext.from_defaults(llm=gpt4)

dataset_generator = RagDatasetGenerator.from_documents(
    documents,
    service_context=service_context_gpt4,
    num_questions_per_chunk=2,
    show_progress=True,
)

eval_dataset = dataset_generator.generate_dataset_from_nodes()

Parsing nodes:   0%|          | 0/10 [00:00<?, ?it/s]

100%|██████████| 10/10 [00:09<00:00,  1.09it/s]
100%|██████████| 2/2 [00:06<00:00,  3.20s/it]
100%|██████████| 2/2 [00:06<00:00,  3.21s/it]
100%|██████████| 2/2 [00:12<00:00,  6.35s/it]
100%|██████████| 2/2 [00:15<00:00,  7.94s/it]
100%|██████████| 2/2 [00:23<00:00, 11.55s/it]
100%|██████████| 2/2 [00:08<00:00,  4.00s/it]
100%|██████████| 2/2 [00:10<00:00,  5.04s/it]
100%|██████████| 2/2 [00:10<00:00,  5.02s/it]
100%|██████████| 2/2 [00:13<00:00,  6.89s/it]
100%|██████████| 2/2 [00:04<00:00,  2.45s/it]


## Download `RagEvaluatorPack` for evaluation.

In [None]:
from llama_index.llama_pack import download_llama_pack

RagEvaluatorPack = download_llama_pack(
  "RagEvaluatorPack", "./rag_evaluator_pack"
)

## Define LLM.

In [None]:
from llama_index.llms import OpenAI

llm = OpenAI(model="gpt-3.5-turbo", temperature=0.1)

## Define Embedding Model

In [None]:
embed_model = "local:BAAI/bge-small-en-v1.5"

## Build RAG pipeline with `SentenceWindow`

In [None]:
from llama_index.node_parser import SentenceWindowNodeParser

# create the sentence window node parser w/ default settings
node_parser = SentenceWindowNodeParser.from_defaults(
    window_size=1,
    window_metadata_key="window",
    original_text_metadata_key="original_text",
)

In [None]:
from llama_index import ServiceContext

sentence_context = ServiceContext.from_defaults(
    llm=llm,
    embed_model=embed_model,
    node_parser=node_parser,
)

config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

In [None]:
from llama_index import Document

document = Document(text="\n\n".join([doc.text for doc in documents]))

In [None]:
from llama_index import VectorStoreIndex

sentence_index = VectorStoreIndex.from_documents(
    [document], service_context=sentence_context
)

In [None]:
from llama_index.indices.postprocessor import SentenceTransformerRerank

rerank = SentenceTransformerRerank(
    top_n=2, model="BAAI/bge-reranker-base"
)

config.json:   0%|          | 0.00/799 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/443 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/279 [00:00<?, ?B/s]

In [None]:
from llama_index.indices.postprocessor import MetadataReplacementPostProcessor

postproc = MetadataReplacementPostProcessor(
    target_metadata_key="window"
)

In [None]:
query_engine = sentence_index.as_query_engine(
    similarity_top_k=2, node_postprocessors=[postproc, rerank]
)

In [None]:
response = query_engine.query('is the paper from google research?')
print(response)

Yes.


## Evaluate RAG pipeline

In [None]:
rag_evaluator_pack = RagEvaluatorPack(
    rag_dataset=eval_dataset,
    query_engine=query_engine
)

base_benchmark = await rag_evaluator_pack.arun(
    batch_size=10,  # batches the number of openai api calls to make
    sleep_time_in_seconds=1,  # seconds to sleep before making an api call
)


Batch processing of predictions: 100%|██████████| 10/10 [00:18<00:00,  1.90s/it]
Batch processing of predictions: 100%|██████████| 10/10 [00:15<00:00,  1.59s/it]
Batch processing of evaluations:  87%|████████▋ | 10/11.5 [01:23<00:12,  8.34s/it]


In [None]:
base_benchmark

rag,base_rag
metrics,Unnamed: 1_level_1
mean_correctness_score,3.3
mean_relevancy_score,0.8
mean_faithfulness_score,1.0
mean_context_similarity_score,0.910959


## Create Functions to build RAG pipeline and Evaluation.

This will make the process of iterating easier for evaluation.

In [None]:
def build_index(
    documents,
    llm=OpenAI(model='gpt-3.5-turbo', temperature=0.1),
    embed_model="local:BAAI/bge-small-en-v1.5",
    sentence_window_size=3,
):
    # create the sentence window node parser w/ default settings
    node_parser = SentenceWindowNodeParser.from_defaults(
        window_size=sentence_window_size,
        window_metadata_key="window",
        original_text_metadata_key="original_text",
    )
    sentence_context = ServiceContext.from_defaults(
        llm=llm,
        embed_model=embed_model,
        node_parser=node_parser,
    )
    sentence_index = VectorStoreIndex.from_documents(
        documents, service_context=sentence_context
    )

    return sentence_index


def setup_query_engine(
    sentence_index, similarity_top_k=2, rerank_top_n=2, is_rerank = False
):
    # define postprocessors
    postproc = MetadataReplacementPostProcessor(target_metadata_key="window")
    if is_rerank:
      rerank = SentenceTransformerRerank(
          top_n=rerank_top_n, model="BAAI/bge-reranker-base"
      )
      query_engine = sentence_index.as_query_engine(
          similarity_top_k=similarity_top_k, node_postprocessors=[postproc, rerank]
      )
    else:
      query_engine = sentence_index.as_query_engine(
          similarity_top_k=similarity_top_k, node_postprocessors=[postproc]
      )
    return query_engine

async def evaluate(query_engine):
  rag_evaluator_pack = RagEvaluatorPack(
      rag_dataset=eval_dataset,
      query_engine=query_engine
  )

  benchmark_df = await rag_evaluator_pack.arun(
      batch_size=10,  # batches the number of openai api calls to make
      sleep_time_in_seconds=1,  # seconds to sleep before making an api call
  )

  return benchmark_df

def build_index_query_engine(window_size,
                             similarity_top_k,
                             rerank_similarity_top_k,
                             rerank_top_k):
  sentence_index = build_index(
      [document],
      sentence_window_size=window_size,
  )

  query_engine = setup_query_engine(sentence_index,
                                      similarity_top_k=similarity_top_k)

  query_engine_rerank = setup_query_engine(sentence_index,
                                            similarity_top_k=rerank_similarity_top_k,
                                            rerank_top_n=rerank_top_k,
                                            is_rerank=True)
  return sentence_index, query_engine, query_engine_rerank

In [None]:
index, query_engine, query_engine_rerank = build_index_query_engine(1, 2, 4, 2)

In [None]:
base_benchmark = await evaluate(query_engine)

Batch processing of predictions: 100%|██████████| 10/10 [00:07<00:00,  1.26it/s]
Batch processing of predictions: 100%|██████████| 10/10 [00:04<00:00,  2.09it/s]
Batch processing of evaluations:  87%|████████▋ | 10/11.5 [01:10<00:10,  7.00s/it]


In [None]:
base_benchmark_rerank = await evaluate(query_engine_rerank)

Batch processing of predictions: 100%|██████████| 10/10 [00:31<00:00,  3.11s/it]
Batch processing of predictions: 100%|██████████| 10/10 [00:28<00:00,  2.86s/it]
Batch processing of evaluations:  87%|████████▋ | 10/11.5 [01:10<00:10,  7.05s/it]


In [None]:
base_benchmark

rag,base_rag
metrics,Unnamed: 1_level_1
mean_correctness_score,3.275
mean_relevancy_score,0.8
mean_faithfulness_score,1.0
mean_context_similarity_score,0.907221


In [None]:
base_benchmark_rerank

rag,base_rag
metrics,Unnamed: 1_level_1
mean_correctness_score,3.15
mean_relevancy_score,0.85
mean_faithfulness_score,1.0
mean_context_similarity_score,0.91542


From the metrics we can observe that `correctness` metric is lower (maximum 5) and including `reranker` improved metrics though it decreased the `correctness` metric.

Interesting to see there are no hallucinations as `faithfulness` metric is 1.0

## Tune parameters to make it production ready.

Let's aim to get `correctness` score of `4.5` and `relevancy` score of more than `0.9`.

### Experiment 1:

Let's increase window size and see if we can improve correctness as it gives more surrounding context.

In [None]:
index, query_engine, query_engine_rerank = build_index_query_engine(3, 2, 4, 2)

In [None]:
benchmark = await evaluate(query_engine)

Batch processing of predictions: 100%|██████████| 10/10 [00:08<00:00,  1.15it/s]
Batch processing of predictions: 100%|██████████| 10/10 [00:04<00:00,  2.23it/s]
Batch processing of evaluations:  87%|████████▋ | 10/11.5 [01:29<00:13,  8.96s/it]


In [None]:
benchmark_rerank = await evaluate(query_engine_rerank)

Batch processing of predictions: 100%|██████████| 10/10 [00:56<00:00,  5.67s/it]
Batch processing of predictions: 100%|██████████| 10/10 [00:54<00:00,  5.50s/it]
Batch processing of evaluations:  87%|████████▋ | 10/11.5 [01:28<00:13,  8.88s/it]


In [None]:
benchmark

rag,base_rag
metrics,Unnamed: 1_level_1
mean_correctness_score,3.775
mean_relevancy_score,0.95
mean_faithfulness_score,1.0
mean_context_similarity_score,0.917831


In [None]:
benchmark_rerank

rag,base_rag
metrics,Unnamed: 1_level_1
mean_correctness_score,3.875
mean_relevancy_score,0.95
mean_faithfulness_score,1.0
mean_context_similarity_score,0.928741


The metrics did improve.

### Experiment 2:

Let'r increase `similarity_top_k` and `reranker_top_n` values and see if getting more relevant contexts improves the result.

In [None]:
index, query_engine, query_engine_rerank = build_index_query_engine(3, 4, 8, 4)

In [None]:
benchmark = await evaluate(query_engine)

Batch processing of predictions: 100%|██████████| 10/10 [00:07<00:00,  1.34it/s]
Batch processing of predictions: 100%|██████████| 10/10 [00:05<00:00,  1.80it/s]
Batch processing of evaluations:  87%|████████▋ | 10/11.5 [01:33<00:13,  9.31s/it]


In [None]:
benchmark_rerank = await evaluate(query_engine_rerank)

Batch processing of predictions: 100%|██████████| 10/10 [01:57<00:00, 11.77s/it]
Batch processing of predictions: 100%|██████████| 10/10 [02:17<00:00, 13.78s/it]
Batch processing of evaluations:  87%|████████▋ | 10/11.5 [01:21<00:12,  8.10s/it]


In [None]:
benchmark

rag,base_rag
metrics,Unnamed: 1_level_1
mean_correctness_score,4.1
mean_relevancy_score,0.95
mean_faithfulness_score,1.0
mean_context_similarity_score,0.924449


In [None]:
benchmark_rerank

rag,base_rag
metrics,Unnamed: 1_level_1
mean_correctness_score,4.275
mean_relevancy_score,0.9
mean_faithfulness_score,1.0
mean_context_similarity_score,0.934464


We are close to our goal. We reached a `correctness` score of `4.275` and `relevancy` score of 0.9. The `context similarity score` also improved

### Experiment 3:

Let's now increase window size and see if it improves the metrics.

In [None]:
index, query_engine, query_engine_rerank = build_index_query_engine(5, 4, 8, 4)

In [None]:
benchmark = await evaluate(query_engine)

Batch processing of predictions: 100%|██████████| 10/10 [00:07<00:00,  1.43it/s]
Batch processing of predictions: 100%|██████████| 10/10 [00:07<00:00,  1.36it/s]
Batch processing of evaluations:  87%|████████▋ | 10/11.5 [01:16<00:11,  7.67s/it]


In [None]:
benchmark_rerank = await evaluate(query_engine_rerank)

Batch processing of predictions: 100%|██████████| 10/10 [03:03<00:00, 18.37s/it]
Batch processing of predictions: 100%|██████████| 10/10 [03:14<00:00, 19.49s/it]
Batch processing of evaluations:  87%|████████▋ | 10/11.5 [01:22<00:12,  8.26s/it]


In [None]:
benchmark

rag,base_rag
metrics,Unnamed: 1_level_1
mean_correctness_score,4.25
mean_relevancy_score,1.0
mean_faithfulness_score,1.0
mean_context_similarity_score,0.926042


In [None]:
benchmark_rerank

rag,base_rag
metrics,Unnamed: 1_level_1
mean_correctness_score,4.25
mean_relevancy_score,1.0
mean_faithfulness_score,1.0
mean_context_similarity_score,0.933203


We have reached a `relevency` score of `1.0`.

### Experiment 4:

Let'r increase `similarity_top_k` and `reranker_top_n` values and see if getting more relevant contexts improves the result.

In [None]:
index, query_engine, query_engine_rerank = build_index_query_engine(5, 6, 12, 6)

In [None]:
benchmark_rerank = await evaluate(query_engine_rerank)

Batch processing of predictions: 100%|██████████| 10/10 [05:00<00:00, 30.01s/it]
Batch processing of predictions: 100%|██████████| 10/10 [05:02<00:00, 30.20s/it]
Batch processing of evaluations:  87%|████████▋ | 10/11.5 [01:20<00:12,  8.09s/it]


In [None]:
benchmark_rerank

rag,base_rag
metrics,Unnamed: 1_level_1
mean_correctness_score,4.5
mean_relevancy_score,1.0
mean_faithfulness_score,1.0
mean_context_similarity_score,0.93266


We have reached our goal of `correctness` score of `4.5` and `relevancy` score of `1.0` (>0.9).

## Observation:

In this project, we looked into building RAG Pipeline, evaluation dataset and tuning different parameters to make it production ready. It should be observed that `reranker` improved metrics in most of the experiments.

Please do remember that we have various other metrics like `chunk_size`, `chunk_overlap`, `embedding model`, `LLM` to experiment.