# Table of content
- [1. Basic RAG](#1)
- [2. Advanced RAG](#2)
    - [2.1 Sentence Window Retrieval](#2.1)
    - [2.2 Auto-merging Retrieval](#2.2)
- [3. Evaluate RAG Systems](#3)
    - [3.1 Metrics](#3.1)
    - [3.2 Evaluate two RAG Systems using Auto-merging Retrieval](#3.2)
- [References](#4)

In [None]:
# Install packages
!pip install llama-index openai tiktoken pypdf nest-asyncio sentence-transformers
!pip install -qU trulens_eval pydantic

Collecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting sentencepiece (from sentence-transformers)
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: sentence-transformers
  Building wheel for sentence-transformers (setup.py) ... [?25l[?25hdone
  Created wheel for sentence-transformers: filename=sentence_transformers-2.2.2-py3-none-any.whl size=125923 sha256=af38b3a82a4d46391ce741178c3fc3e9c5ff5851c0774dca9ba2f50970fcbff5
  Stored in directory: /root/.cache/pip/wheels/62/f2/10/1e606fd5f02395388f74e7462910fe851042f97238cbbd902f
Successfully built sentence-tra

<a name='1' ></a>
# 1. Basic RAG

![](https://i.imgur.com/6SMPJLD.png)

The Vanilla Retrieval-Augmented Generation (RAG), also known as Naive RAG or Basic RAG, follows a straightforward process. Initially, you segment your texts into chunks. These chunks are then transformed into vectors using a Transformer Encoder model. Subsequently, all the vectors are placed into an index. Finally, you construct a prompt for a Language Model (LLM), instructing it to respond to the user's query based on the context identified during the search phase.

During runtime, the user's query is vectorized using the same Encoder model. The query vector is then compared against the index through a search operation, identifying the top-k results. The corresponding text chunks are retrieved from the database and provided to the LLM prompt as context for generating a response.

In [None]:
import os
from llama_index import ServiceContext, OpenAIEmbedding
from llama_index.llms import OpenAI
from llama_index.node_parser import SimpleNodeParser
from llama_index import VectorStoreIndex, SimpleDirectoryReader
from llama_index import set_global_service_context

os.environ['OPENAI_API_KEY'] = "YOUR_OPENAI_API_KEY"

In [None]:
documents = SimpleDirectoryReader(
    input_files=["./eBook-How-to-Build-a-Career-in-AI.pdf"]
).load_data()

print(type(documents), "\n")
print(len(documents), "\n")
print(type(documents[0]))
print(documents[0])

<class 'list'> 

41 

<class 'llama_index.schema.Document'>
Doc ID: 6aa5af2a-93a5-4775-85ed-abaaff4e58f8
Text: PAGE 1Founder, DeepLearning.AICollected Insights from Andrew Ng
How to  Build Your Career in AIA Simple Guide


In [None]:
# Embeddings
llm = OpenAI(model="gpt-3.5-turbo", temperature=0.1)
embed_model = OpenAIEmbedding()

service_context = ServiceContext.from_defaults(
  llm=llm,
  embed_model=embed_model,
)

# Vector Database
index = VectorStoreIndex.from_documents(
    documents,
    service_context=service_context
)

# Query
query_engine = index.as_query_engine(service_context=service_context)
response = query_engine.query("What are the keys to building a career in AI?")
print(response)

The keys to building a career in AI include job search, personal discipline, and altruism.


<a name='2' ></a>
# 2. Advanced RAG


Engaging in prompt engineering stands out as a cost-effective strategy to enhance your RAG pipeline. Nevertheless, given that documents are segmented into chunks, the pivotal aspect of the RAG pipeline is the search index. If the RAG fails to deliver the appropriate chunk, the overall output quality will not see improvement.

One viable solution involves implementing context enrichment. This approach entails retrieving smaller chunks to enhance search quality while supplementing the surrounding context for the LLM to effectively reason upon. Two options are available for achieving this goal:
- **Sentence Window Retrieval:** to expand context by sentences around the smaller retrieved chunk
- **Auto-merging Retrieval:** to split documents recursively into a number of larger parent chunks, containing smaller child chunks.

In [None]:
from llama_index import Document

# Merge document from pages
document = Document(text="\n\n".join([doc.text for doc in documents]))

<a name='2.1' ></a>
## 2.1 Sentence Window Retrieval
In this scheme each sentence in a document is embedded separately which provides great accuracy of the query to context cosine distance search.
In order to better reason upon the found context after fetching the most relevant single sentence we extend the context window by k sentences before and after the retrieved sentence and then send this extended context to LLM.

![](https://i.imgur.com/HWdHXMc.png)

The green part is the sentence embedding found while search in index, and the whole black + green paragraph is fed to the LLM to enlarge its context while reasoning upon the provided query.



In [None]:
import os
from llama_index import ServiceContext, VectorStoreIndex, StorageContext
from llama_index.node_parser import SentenceWindowNodeParser
from llama_index.indices.postprocessor import MetadataReplacementPostProcessor
from llama_index.indices.postprocessor import SentenceTransformerRerank


def build_sentence_window_index(
    documents,
    llm,
    embed_model,
    sentence_window_size=3,
):
    # create the sentence window node parser w/ default settings
    node_parser = SentenceWindowNodeParser.from_defaults(
        window_size=sentence_window_size,
        window_metadata_key="window",
        original_text_metadata_key="original_text",
    )

    sentence_context = ServiceContext.from_defaults(
        llm=llm,
        embed_model=embed_model,
        node_parser=node_parser,
    )

    sentence_index = VectorStoreIndex.from_documents(
        documents, service_context=sentence_context
    )

    return sentence_index



def get_sentence_window_query_engine(
    sentence_index, similarity_top_k=6, rerank_top_n=2
):
    # define postprocessors
    postproc = MetadataReplacementPostProcessor(target_metadata_key="window")
    rerank = SentenceTransformerRerank(
        top_n=rerank_top_n, model="BAAI/bge-reranker-base"
    )

    sentence_window_engine = sentence_index.as_query_engine(
        similarity_top_k=similarity_top_k, node_postprocessors=[postproc, rerank]
    )
    return sentence_window_engine

In [None]:
from llama_index.llms import OpenAI

index = build_sentence_window_index(
    [document],
    llm=llm,
    embed_model=embed_model,
)

query_engine = get_sentence_window_query_engine(index, similarity_top_k=6)

config.json:   0%|          | 0.00/799 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/443 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/279 [00:00<?, ?B/s]

In [None]:
response = query_engine.query("What are the keys to building a career in AI?")
print(response)

The keys to building a career in AI are learning foundational technical skills, working on projects, and finding a job, all of which is supported by being part of a community.


<a name='2.2' ></a>
## 2.2 Auto-merging Retrieval (Parent Document Retriever)
The idea here is pretty much similar to Sentence Window Retriever — to search for more granular pieces of information and then to extend the context window before feeding said context to an LLM for reasoning. Documents are split into smaller child chunks referring to larger parent chunks.

![](https://i.imgur.com/lMHM45C.png)

Fetch smaller chunks during retrieval first, then if more than n chunks in top k retrieved chunks are linked to the same parent node (larger chunk), we replace the context fed to the LLM by this parent node — works like auto merging a few retrieved chunks into a larger parent chunk, hence the method name. Just to note — search is performed just within the child nodes index.

In [None]:
import os

from llama_index import ServiceContext, VectorStoreIndex, StorageContext
from llama_index.node_parser import HierarchicalNodeParser
from llama_index.node_parser import get_leaf_nodes
from llama_index.retrievers import AutoMergingRetriever
from llama_index.indices.postprocessor import SentenceTransformerRerank
from llama_index.query_engine import RetrieverQueryEngine


def build_automerging_index(
    documents,
    llm,
    embed_model,
    chunk_sizes=None,
):
    chunk_sizes = chunk_sizes or [2048, 512, 128]
    node_parser = HierarchicalNodeParser.from_defaults(chunk_sizes=chunk_sizes)
    nodes = node_parser.get_nodes_from_documents(documents)
    leaf_nodes = get_leaf_nodes(nodes)
    merging_context = ServiceContext.from_defaults(
        llm=llm,
        embed_model=embed_model,
    )
    storage_context = StorageContext.from_defaults()
    storage_context.docstore.add_documents(nodes)

    automerging_index = VectorStoreIndex(
        leaf_nodes, storage_context=storage_context, service_context=merging_context
    )

    return automerging_index


def get_automerging_query_engine(
    automerging_index,
    similarity_top_k=12,
    rerank_top_n=6,
):
    base_retriever = automerging_index.as_retriever(similarity_top_k=similarity_top_k)
    retriever = AutoMergingRetriever(
        base_retriever, automerging_index.storage_context, verbose=True
    )
    rerank = SentenceTransformerRerank(
        top_n=rerank_top_n, model="BAAI/bge-reranker-base"
    )
    auto_merging_engine = RetrieverQueryEngine.from_args(
        retriever, node_postprocessors=[rerank]
    )

    return auto_merging_engine

In [None]:
from llama_index.llms import OpenAI

index = build_sentence_window_index(
    [document],
    llm=llm,
    embed_model=embed_model
)

query_engine = get_automerging_query_engine(index, similarity_top_k=6)

In [None]:
response = query_engine.query("What are the keys to building a career in AI?")
print(response)

The keys to building a career in AI include learning foundational technical skills, working on projects, finding a job, and being part of a community. Networking is also an important step in building a career in AI.


<a name='3' ></a>
# 3. Evaluate RAG Systems



## 3.1 Metrics
The RAG Triad - Context Relevance, Groundedness, and Answer Relevance - is a key framework for evaluating the quality of responses generated by Retrieval Augmented Generation (RAG) systems. These three measures assess the different aspects of how well the LLM's response aligns with the provided context and query.

- Context Relevance:
    - This measures how relevant the retrieved information is to the original query.
    - Evaluation involves assessing if the retrieved passages actually address the question at hand and don't introduce irrelevant tangents.

- Groundedness:
    - This assesses how well the generated response is supported by the retrieved information.
    - It checks if the response accurately reflects the facts and claims found in the retrieved passages and doesn't invent or exaggerate facts.
- Answer Relevance:
    - This evaluates how well the generated response actually answers the query in a informative and helpful way.
    - It goes beyond just factual accuracy and considers the overall usefulness and completeness of the response for the user's intent.

By examining these three aspects, the RAG Triad provides a comprehensive picture of how well the LLM's response combines retrieved information with the original query context. This helps developers identify weaknesses in their RAG system and work towards generating more accurate, relevant, and helpful responses.

In the following, we will [`TruLens`](https://github.com/truera/trulens) - a package provides a set of tools for evaluation of LLMs and LLM-based applications, to evaluate a RAG System.

In [None]:
%%writefile utils.py
import os
import numpy as np
import nest_asyncio

nest_asyncio.apply()

from trulens_eval import (
    Feedback,
    TruLlama,
    OpenAI
)

from trulens_eval.feedback import Groundedness

def get_prebuilt_trulens_recorder(query_engine, app_id):
    openai = OpenAI()

    qa_relevance = (
        Feedback(openai.relevance_with_cot_reasons, name="Answer Relevance")
        .on_input_output()
    )

    qs_relevance = (
        Feedback(openai.relevance_with_cot_reasons, name="Context Relevance")
        .on_input()
        .on(TruLlama.select_source_nodes().node.text)
        .aggregate(np.mean)
    )

#     grounded = Groundedness(groundedness_provider=openai, summarize_provider=openai)
    grounded = Groundedness(groundedness_provider=openai)

    groundedness = (
        Feedback(grounded.groundedness_measure_with_cot_reasons, name="Groundedness")
            .on(TruLlama.select_source_nodes().node.text)
            .on_output()
            .aggregate(grounded.grounded_statements_aggregator)
    )

    feedbacks = [qa_relevance, qs_relevance, groundedness]
    tru_recorder = TruLlama(
        query_engine,
        app_id=app_id,
        feedbacks=feedbacks
    )

    return tru_recorder

Writing utils.py



To assess a Retrieval-Augmented Generation (RAG) system, a set of questions is essential. You can either use LlamaIndex to automatically generate them

Here's the code to generate questions:
```python
# UNCOMMENT THE BELOW CODE TO GENERATE QUESTIONS FROM DOCUMENT
import nest_asyncio
from llama_index.evaluation import DatasetGenerator

nest_asyncio.apply()

data_generator = DatasetGenerator.from_documents(documents)
eval_questions = data_generator.generate_questions_from_nodes()
eval_questions
```

or create them manually as the below.

In [None]:
%%writefile generated_questions.txt
In the context of job switching, what advantages does working in a startup offer over a big company, particularly for someone looking to transition into an AI role?
Explain the scenario planning exercise mentioned in the context of the Covid-19 pandemic. How can this method be applied to personal life situations such as job hunting or exam preparation?
In the context of project selection and execution, explain the difference between the "Ready, Aim, Fire" and "Ready, Fire, Aim" approaches. Provide examples where each approach might be more beneficial.
Discuss the importance of working on projects that complement your career goals. How can these projects serve as stepping stones to larger projects and contribute to technical growth and skill progression?
According to the text, what are some of the key areas of mathematics that are relevant to machine learning and why are they important in the field of AI?

Writing generated_questions.txt


In [None]:
def run_evals(eval_questions, tru_recorder, query_engine):
    for question in eval_questions:
        with tru_recorder as recording:
            response = query_engine.query(question)


eval_questions = []
with open('generated_questions.txt', 'r') as file:
    for line in file:
        # Remove newline character and convert to integer
        item = line.strip()
        eval_questions.append(item)

eval_questions

['In the context of job switching, what advantages does working in a startup offer over a big company, particularly for someone looking to transition into an AI role?',
 'Explain the scenario planning exercise mentioned in the context of the Covid-19 pandemic. How can this method be applied to personal life situations such as job hunting or exam preparation?',
 'In the context of project selection and execution, explain the difference between the "Ready, Aim, Fire" and "Ready, Fire, Aim" approaches. Provide examples where each approach might be more beneficial.',
 'Discuss the importance of working on projects that complement your career goals. How can these projects serve as stepping stones to larger projects and contribute to technical growth and skill progression?',
 'According to the text, what are some of the key areas of mathematics that are relevant to machine learning and why are they important in the field of AI?']

In [None]:
from trulens_eval import Tru

Tru().reset_database()

🦑 Tru initialized with db url sqlite:///default.sqlite .
🛑 Secret keys may be written to the database. See the `database_redact_keys` option of `Tru` to prevent this.


<a name='3.2' ></a>
## 3.2 Evaluate two RAG Systems using Auto-merging Retrievals
- Two Layers: `chunk_sizes=[2048, 512]`
- Three Layers: `chunk_sizes=[2048, 512, 128]`

### Two Layers

In [None]:
auto_merging_index_0 = build_automerging_index(
    documents,
    llm=llm,
    embed_model=embed_model,
    chunk_sizes=[2048, 512],
)

auto_merging_engine_0 = get_automerging_query_engine(
    auto_merging_index_0,
    similarity_top_k=12,
    rerank_top_n=6,
)

In [None]:
from utils import get_prebuilt_trulens_recorder

tru_recorder = get_prebuilt_trulens_recorder(
    auto_merging_engine_0,
    app_id ='app_0'
)

✅ In Answer Relevance, input prompt will be set to __record__.main_input or `Select.RecordInput` .
✅ In Answer Relevance, input response will be set to __record__.main_output or `Select.RecordOutput` .
✅ In Context Relevance, input prompt will be set to __record__.main_input or `Select.RecordInput` .
✅ In Context Relevance, input response will be set to __record__.app.query.rets.source_nodes[:].node.text .
✅ In Groundedness, input source will be set to __record__.app.query.rets.source_nodes[:].node.text .
✅ In Groundedness, input statement will be set to __record__.main_output or `Select.RecordOutput` .


In [None]:
run_evals(eval_questions, tru_recorder, auto_merging_engine_0)

> Merging 1 nodes into parent node.
> Parent node id: e375a8f6-b4ec-4c3e-9bae-6f64792c963c.
> Parent node text: PAGE 26If you’re considering a role switch, a startup can be an easier place to do it than a big ...

> Merging 1 nodes into parent node.
> Parent node id: 3af777c1-781a-4a0e-824a-13fa54e8517f.
> Parent node text: PAGE 25Finding a job has a few predictable steps that include selecting the companies to which yo...

> Merging 1 nodes into parent node.
> Parent node id: a34f551b-d6a4-4d4d-bfd3-5615577a1cca.
> Parent node text: PAGE 33Choose who to work with. It’s tempting to take a position because of the projects you’ll w...

> Merging 1 nodes into parent node.
> Parent node id: 468ef6c5-748e-4b1b-ad14-80eef2889266.
> Parent node text: PAGE 19Develop a side hustle. Even if you have a full-time job, a fun project that may or may not...

> Merging 1 nodes into parent node.
> Parent node id: 9d319454-cfb3-4c45-a948-4587c7af4fd4.
> Parent node text: PAGE 18It goes without saying th

In [None]:
from trulens_eval import Tru

Tru().get_leaderboard(app_ids=[])

Unnamed: 0_level_0,Answer Relevance,Context Relevance,Groundedness,latency,total_cost
app_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
app_0,0.975,0.304167,1.0,33.4,0.004256


### Three Layers

In [None]:
auto_merging_index_1 = build_automerging_index(
    documents,
    llm=llm,
    embed_model=embed_model,
    chunk_sizes=[2048, 512, 128],
)

auto_merging_engine_1 = get_automerging_query_engine(
    auto_merging_index_1,
    similarity_top_k=12,
    rerank_top_n=6,
)

In [None]:
tru_recorder = get_prebuilt_trulens_recorder(
    auto_merging_engine_1,
    app_id ='app_1'
)

✅ In Answer Relevance, input prompt will be set to __record__.main_input or `Select.RecordInput` .
✅ In Answer Relevance, input response will be set to __record__.main_output or `Select.RecordOutput` .
✅ In Context Relevance, input prompt will be set to __record__.main_input or `Select.RecordInput` .
✅ In Context Relevance, input response will be set to __record__.app.query.rets.source_nodes[:].node.text .
✅ In Groundedness, input source will be set to __record__.app.query.rets.source_nodes[:].node.text .
✅ In Groundedness, input statement will be set to __record__.main_output or `Select.RecordOutput` .


In [None]:
run_evals(eval_questions, tru_recorder, auto_merging_engine_1)

> Merging 4 nodes into parent node.
> Parent node id: 654e72b4-f234-4509-b2fe-9701c39a2493.
> Parent node text: PAGE 26If you’re considering a role switch, a startup can be an easier place to do it than a big ...

> Merging 5 nodes into parent node.
> Parent node id: 7f82ea88-1933-4f55-8031-0d7fc2137fd5.
> Parent node text: PAGE 25Finding a job has a few predictable steps that include selecting the companies to which yo...

> Merging 1 nodes into parent node.
> Parent node id: 547d3e41-c058-4d37-92b4-b0d8235d1348.
> Parent node text: PAGE 26If you’re considering a role switch, a startup can be an easier place to do it than a big ...

> Merging 1 nodes into parent node.
> Parent node id: ffc5c6c4-b939-4e6d-9187-a512b2ba4f08.
> Parent node text: PAGE 25Finding a job has a few predictable steps that include selecting the companies to which yo...

> Merging 5 nodes into parent node.
> Parent node id: d686bc26-8413-498f-b0eb-87ecacaf21ce.
> Parent node text: PAGE 27There’s a lot we don’t kn

In [None]:
from trulens_eval import Tru

Tru().get_leaderboard(app_ids=[])

Unnamed: 0_level_0,Answer Relevance,Context Relevance,Groundedness,latency,total_cost
app_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
app_0,0.98,0.243333,0.98,33.4,0.004256
app_1,0.95,0.284167,0.916667,33.4,0.002269


In [None]:
# See detail leaderboard
Tru().run_dashboard()

Starting dashboard ...
npx: installed 22 in 4.784s

Go to this url and submit the ip given here. your url is: https://cute-numbers-listen.loca.lt

  Submit this IP Address: 34.16.163.204



<Popen: returncode: None args: ['streamlit', 'run', '--server.headless=True'...>

**Trade-offs and Balancing:** Sometimes, there might be trade-offs between these measures. For example, a highly context-relevant response might include some minor factual inaccuracies compared to a perfectly grounded response that's less directly related to the specific query. It's important to consider the relative importance of each factor depending on the task and user needs.

<a name='4' ></a>
# References
- [Building and Evaluating Advanced RAG Applications](https://www.deeplearning.ai/short-courses/building-evaluating-advanced-rag/)
- [Advanced RAG Techniques: an Illustrated Overview](https://pub.towardsai.net/advanced-rag-techniques-an-illustrated-overview-04d193d8fec6)
- [Evaluate RAG with LlamaIndex](https://cookbook.openai.com/examples/evaluation/evaluate_rag_with_llamaindex)

See more detail at my github - [QuyAnh2005](https://github.com/QuyAnh2005/RAG-with-LlamaIndex/tree/main)