# Optimizing RAG with LlamaIndex
We will improve a Retrieval-Augmented Generation (RAG) system using LlamaIndex. Here is the plan:

- <u>Adjusting TOP_K Retrieval:</u>
Trying out TOP_K values [1, 3, 5, 7] to see the impact on the retrieved information and the generated answers.

- <u>Testing Different Embedding Models:</u>
Comparing models like "text-embedding-ada-002", and "cohere/embed-english-v3.0" to find the one that gives the best results.

- <u>Adding a Reranker:</u>
Using a reranker, to "rank" the documents from the retriever.
- <u>Using a Deep Memory:</u>
See how it helps with retrieval accuracy.

Our goal is to try different changes and see which ones make our system better at finding and giving accurate information.

#### Let's build a RAG pipeline using Llama-index.

In [None]:
!pip3 install deeplake llama_index langchain openai tiktoken cohere pandas torch sentence-transformers

In [None]:
# The nest_asyncio module enables the nesting of asynchronous functions within an already running async loop.
# This is necessary because Jupyter notebooks inherently operate in an asynchronous loop.
# By applying nest_asyncio, we can run additional async functions within this existing loop without conflicts.
import nest_asyncio
nest_asyncio.apply()


In [None]:
import os, getpass
os.environ['ACTIVELOOP_TOKEN'] = getpass.getpass()

In [None]:
os.environ['OPENAI_API_KEY'] = getpass.getpass()

In [None]:
os.environ['COHERE_API_KEY'] = getpass.getpass()

#### Download Data

In [None]:
!mkdir -p 'data/paul_graham/'
!curl 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt' -o 'data/paul_graham/paul_graham_essay.txt'

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 75042  100 75042    0     0   214k      0 --:--:-- --:--:-- --:--:--  214k


#### Load Data and Build nodes.

In [None]:
from llama_index.node_parser import SimpleNodeParser
from llama_index import SimpleDirectoryReader

documents = SimpleDirectoryReader("./data/paul_graham/").load_data()
node_parser = SimpleNodeParser.from_defaults(chunk_size=512)
nodes = node_parser.get_nodes_from_documents(documents)

# By default, the node/chunks ids are set to random uuids. To ensure same id's per run, we manually set them.
for idx, node in enumerate(nodes):
    node.id_ = f"node_{idx}"

print(f"Number of Documents: {len(documents)}")
print(f"Number of nodes: {len(nodes)} with the current chunk size of {node_parser.chunk_size}")

[nltk_data] Downloading package punkt to /tmp/llama_index...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Number of Documents: 1
Number of nodes: 58 with the current chunk size of 512


In [None]:
from llama_index import VectorStoreIndex, ServiceContext, StorageContext
from llama_index.vector_stores import DeepLakeVectorStore
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms import OpenAI

# Create a local Deep Lake VectorStore
dataset_path = "./data/paul_graham/deep_lake_db"
vector_store = DeepLakeVectorStore(dataset_path=dataset_path, overwrite=True)

# LLM that will answer questions with the retrieved context
llm = OpenAI(model="gpt-3.5-turbo-1106")
# We use OpenAI's embedding model "text-embedding-ada-002"
embed_model = OpenAIEmbedding()

service_context = ServiceContext.from_defaults(embed_model=embed_model, llm=llm,)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

vector_index = VectorStoreIndex(nodes, service_context=service_context, storage_context=storage_context, show_progress=True)



Generating embeddings:   0%|          | 0/58 [00:00<?, ?it/s]

Uploading data to deeplake dataset.


100%|██████████| 58/58 [00:00<00:00, 169.79it/s]

Dataset(path='./data/paul_graham/deep_lake_db', tensors=['text', 'metadata', 'embedding', 'id'])

  tensor      htype      shape      dtype  compression
  -------    -------    -------    -------  ------- 
   text       text      (58, 1)      str     None   
 metadata     json      (58, 1)      str     None   
 embedding  embedding  (58, 1536)  float32   None   
    id        text      (58, 1)      str     None   





#### With the vector index we can now build a QueryEngine. Which generates answers with an LLM and the retrieved chunks of text.

In [None]:
query_engine = vector_index.as_query_engine()
response_vector = query_engine.query("What are the main things Paul worked on before college?")
print(response_vector.response)

Before college, Paul worked on writing and programming.


## Base line Evaluation

While it's beneficial to examine individual queries and responses at the start, this approach may become impractical as the volume of edge cases and failures increases. Instead, it may be more effective to establish a suite of summary metrics or automated evaluations. These tools can provide insights into overall system performance and indicate specific areas that may require closer scrutiny.

In a RAG system, evaluation focuses on two critical aspects:

*   **Retrieval Evaluation:** This assesses the accuracy and relevance of the information retrieved by the system.
*   **Response Evaluation:** This measures the quality and appropriateness of the responses generated by the system based on the retrieved information. Most of the time this is measured and evaluated by GPT4.

Llama-index offers Retrieval and Response evaluation functions.

#### Question-Context Pair Generation:

For the evaluation of a RAG system, it's essential to have queries that can fetch the correct context and subsequently generate an appropriate response. `LlamaIndex` offers a `generate_question_context_pairs` module specifically for crafting questions and context pairs which can be used in the assessment of the RAG system of both Retrieval and Response Evaluation. For more details on Question Generation, please refer to the [documentation](https://docs.llamaindex.ai/en/stable/examples/evaluation/QuestionGeneration.html).

In [None]:
from llama_index.evaluation import generate_question_context_pairs
qc_dataset = generate_question_context_pairs(
    nodes,
    llm=llm,
    num_questions_per_chunk=1
)
# We can save the dataset as a json file for later use.
qc_dataset.save_json("qc_dataset.json")

100%|██████████| 58/58 [01:30<00:00,  1.56s/it]


In [None]:
# We can also load the dataset from a json file.
from llama_index.finetuning.embeddings.common import (
    EmbeddingQAFinetuneDataset,
)
qc_dataset = EmbeddingQAFinetuneDataset.from_json(
    "qc_dataset.json"
)

### Evaluation for Hit Rate and Mean Reciprocal Rank (MRR)

With the generated dataset of questions/context. We are now prepared to conduct our retrieval evaluations.

We will make use of `RetrieverEvaluator` available in Llama-index. We will measure the Hit Rate and Mean Reciprocal Rank (MRR).

**Hit Rate:**

Think of the Hit Rate like playing a game of guessing. You're given a question and you need to guess the correct answer from a list of options. The Hit Rate measures how often you guess the correct answer by only looking at your top few guesses. If you often find the right answer in your first few guesses, you have a high Hit Rate. So, in the context of a retrieval system, it's about how frequently the system finds the correct document within its top 'k' picks (where 'k' is a number you decide, like top 5 or top 10).

**Mean Reciprocal Rank (MRR):**

MRR is a bit like measuring how quickly you can find a treasure in a list of boxes. Imagine you have a row of boxes and only one of them has a treasure. The MRR calculates how close to the start of the row the treasure box is, on average. If the treasure is always in the first box you open, you're doing great and have an MRR of 1. If it's in the second box, the score is 1/2, since you took two tries to find it. If it's in the third box, your score is 1/3, and so on. MRR averages these scores across all your searches. So, for a retrieval system, MRR looks at where the correct document ranks in the system's guesses. If it's usually near the top, the MRR will be high, indicating good performance.
In summary, Hit Rate tells you how often the system gets it right in its top guesses, and MRR tells you how close to the top the right answer usually is. Both metrics are useful for evaluating the effectiveness of a retrieval system, like how well a search engine or a recommendation system works.

First we define a function to display the Retrieval evaluation results in table format.

In [None]:
import pandas as pd

def display_results_retriever(name, eval_results):
    """Display results from evaluate."""

    metric_dicts = []
    for eval_result in eval_results:
        metric_dict = eval_result.metric_vals_dict
        metric_dicts.append(metric_dict)

    full_df = pd.DataFrame(metric_dicts)

    hit_rate = full_df["hit_rate"].mean()
    mrr = full_df["mrr"].mean()

    metric_df = pd.DataFrame(
        {"Retriever Name": [name], "Hit Rate": [hit_rate], "MRR": [mrr]}
    )

    return metric_df

Then Run the evaluation procedure

In [None]:
from llama_index.evaluation import RetrieverEvaluator

# We can evaluate the retievers with different top_k values.
for i in [2, 4, 6, 8, 10]:
    retriever = vector_index.as_retriever(similarity_top_k=i)
    retriever_evaluator = RetrieverEvaluator.from_metric_names(
        ["mrr", "hit_rate"], retriever=retriever
    )
    eval_results = await retriever_evaluator.aevaluate_dataset(qc_dataset)
    print(display_results_retriever(f"Retriever top_{i}", eval_results))


    Retriever Name  Hit Rate       MRR
0  Retriever top_2  0.687943  0.560284
    Retriever Name  Hit Rate       MRR
0  Retriever top_4  0.829787  0.602837
    Retriever Name  Hit Rate       MRR
0  Retriever top_6  0.893617  0.615366
    Retriever Name  Hit Rate       MRR
0  Retriever top_8  0.943262  0.621952
     Retriever Name  Hit Rate       MRR
0  Retriever top_10  0.957447  0.623449


We notice that Hit Rate increases as the top_k value increases. Which is what we can expect. we're essentially increasing the probability of the correct answer being included in the returned set. This is because a wider net is cast to capture potential correct chunks.

### Evaluation for Relevancy and Faithfulness metrics.

**Relevancy**
Evaluates whether retrieved context and answer are relevant to the query.

**Faithfulness**
Evaluates if the answer is faithful to the retrieved contexts (in other words, whether if there’s hallucination).

Now lets see how the top_k value affects these two metrics.

In [None]:
from llama_index.evaluation import RelevancyEvaluator, FaithfulnessEvaluator, BatchEvalRunner

for i in [2, 4, 6, 8, 10]:
    # Set Faithfulness and Relevancy evaluators
    query_engine = vector_index.as_query_engine(similarity_top_k=i)

    # While we use GPT3.5-Turbo to answer questions, we can use GPT4 to evaluate the answers.
    llm_gpt4 = OpenAI(temperature=0, model="gpt-4-1106-preview")
    service_context_gpt4 = ServiceContext.from_defaults(llm=llm_gpt4)

    faithfulness_evaluator = FaithfulnessEvaluator(service_context=service_context_gpt4)
    relevancy_evaluator = RelevancyEvaluator(service_context=service_context_gpt4)

    # Run evaluation
    queries = list(qc_dataset.queries.values())
    batch_eval_queries = queries[:20]

    runner = BatchEvalRunner(
    {"faithfulness": faithfulness_evaluator, "relevancy": relevancy_evaluator},
    workers=8,
    )
    eval_results = await runner.aevaluate_queries(
        query_engine, queries=batch_eval_queries
    )
    faithfulness_score = sum(result.passing for result in eval_results['faithfulness']) / len(eval_results['faithfulness'])
    print(f"top_{i} faithfulness_score: {faithfulness_score}")

    relevancy_score = sum(result.passing for result in eval_results['faithfulness']) / len(eval_results['relevancy'])
    print(f"top_{i} relevancy_score: {relevancy_score}")


top_2 faithfulness_score: 0.95
top_2 relevancy_score: 0.95
top_4 faithfulness_score: 0.95
top_4 relevancy_score: 0.95
top_6 faithfulness_score: 0.95
top_6 relevancy_score: 0.95
top_8 faithfulness_score: 1.0
top_8 relevancy_score: 1.0
top_10 faithfulness_score: 1.0
top_10 relevancy_score: 1.0


We can notice the relevancy and faithfulness scores increase as the Top_k value increases.

## Changing the embedding model

Now that we have a base line evaluation we can change around some modules from the RAG pipeline, such as the embedding model.

In [None]:
import os
from llama_index import VectorStoreIndex, ServiceContext, StorageContext
from llama_index.vector_stores import DeepLakeVectorStore
from llama_index.embeddings.cohereai import CohereEmbedding
from llama_index.llms import OpenAI

# Create a DeepLakeVectorStore locally to store the vectors
dataset_path = "./data/paul_graham/deep_lake_db_1"
vector_store = DeepLakeVectorStore(dataset_path=dataset_path, overwrite=False)

llm = OpenAI(model="gpt-3.5-turbo-1106")
embed_model = CohereEmbedding(
    cohere_api_key=os.getenv('COHERE_API_KEY'),
    model_name="embed-english-v3.0",
    input_type="search_document",
)


service_context = ServiceContext.from_defaults(embed_model=embed_model, llm=llm,)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

vector_index = VectorStoreIndex(nodes, service_context=service_context, storage_context=storage_context, show_progress=True)




Generating embeddings:   0%|          | 0/58 [00:00<?, ?it/s]

Uploading data to deeplake dataset.


100%|██████████| 58/58 [00:00<00:00, 315.69it/s]

Dataset(path='./data/paul_graham/deep_lake_db_1', tensors=['text', 'metadata', 'embedding', 'id'])

  tensor      htype      shape      dtype  compression
  -------    -------    -------    -------  ------- 
   text       text      (58, 1)      str     None   
 metadata     json      (58, 1)      str     None   
 embedding  embedding  (58, 1024)  float32   None   
    id        text      (58, 1)      str     None   





Run the evaluation using these new embeddings.

In [None]:
from llama_index.evaluation import RetrieverEvaluator

embed_model.input_type = "search_query"
retriever = vector_index.as_retriever(similarity_top_k=10, embed_model=embed_model)

retriever_evaluator = RetrieverEvaluator.from_metric_names(
    ["mrr", "hit_rate"], retriever=retriever
)
eval_results = await retriever_evaluator.aevaluate_dataset(qc_dataset)
print(display_results_retriever(f"Retriever_cohere_embeds", eval_results))

            Retriever Name  Hit Rate       MRR
0  Retriever_cohere_embeds  0.943262  0.648697


We see a lower Hit Rate but a better MRR value.

## Add a Raranker to the RAG pipeline

In [None]:
from llama_index.postprocessor.cohere_rerank import CohereRerank
from llama_index.indices.postprocessor import SentenceTransformerRerank, LLMRerank

st_reranker = SentenceTransformerRerank(
    top_n=5, model="cross-encoder/ms-marco-MiniLM-L-6-v2"
)

llm_reranker = LLMRerank(
    choice_batch_size=4, top_n=5,
)
cohere_rerank = CohereRerank(api_key=os.getenv('COHERE_API_KEY'), top_n=10)
for reranker in [cohere_rerank, st_reranker, llm_reranker]:
    retriever_with_reranker = vector_index.as_retriever(similarity_top_k=10, postprocessor=reranker, embed_model=embed_model)

    retriever_evaluator_1 = RetrieverEvaluator.from_metric_names(
        ["mrr", "hit_rate"], retriever=retriever_with_reranker
    )
    eval_results1 = await retriever_evaluator_1.aevaluate_dataset(qc_dataset)
    print(display_results_retriever("Retriever with added Reranker", eval_results1))

config.json:   0%|          | 0.00/794 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/316 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

                  Retriever Name  Hit Rate       MRR
0  Retriever with added Reranker  0.943262  0.648697
                  Retriever Name  Hit Rate       MRR
0  Retriever with added Reranker  0.943262  0.648697
                  Retriever Name  Hit Rate       MRR
0  Retriever with added Reranker  0.943262  0.648697


Here we don't see a significant improvement in the retriever performance.

## Add Deep Memory to the RAG pipeline

[Activeloop's Deep Memory](https://www.activeloop.ai/resources/use-deep-memory-to-boost-rag-apps-accuracy-by-up-to-22/), a feature available to Activeloop Deep Lake users, addresses these issues by introducing a tiny neural network layer trained to match user queries with relevant data from a corpus. While this addition incurs minimal latency during search, it can boost retrieval accuracy by up to 27%.

First let's convert our generated dataset into a format Deep Memory expects.


In [None]:
def create_query_relevance(qa_dataset):
    """Function for converting llama-index dataset to correct format for deep memory training"""
    queries = [text for _, text in qa_dataset.queries.items()]
    relevant_docs = qa_dataset.relevant_docs
    relevance = []
    for doc in relevant_docs:
        relevance.append([(relevant_docs[doc][0], 1)])
    return queries, relevance

train_queries, train_relevance = create_query_relevance(qc_dataset)
print(len(train_queries))

141


Now let's put our Vectore Store on Active Loop's platform and convert it into a managed database.

In [None]:
import deeplake
local = "./data/paul_graham/deep_lake_db"
hub_path = "hub://genai360/optimization_paul_graham"
hub_managed_path = "hub://genai360/optimization_paul_graham_managed"

First upload our local vector store
deeplake.deepcopy(local, hub_path, overwrite=True)
Create a managed vector store
deeplake.deepcopy(hub_path, hub_managed_path, overwrite=True, runtime={"tensor_db": True})

In [None]:
import os
from llama_index import VectorStoreIndex, ServiceContext, StorageContext
from llama_index.vector_stores import DeepLakeVectorStore
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms import OpenAI

vector_store = DeepLakeVectorStore(dataset_path=hub_managed_path, overwrite=False, runtime={"tensor_db": True}, read_only=True)
llm = OpenAI(model="gpt-3.5-turbo-1106")
embed_model = OpenAIEmbedding()

service_context = ServiceContext.from_defaults(embed_model=embed_model, llm=llm,)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

vector_index = VectorStoreIndex.from_vector_store(vector_store,service_context=service_context, storage_context=storage_context, use_async=False, show_progress=True)

Deep Lake Dataset in hub://genai360/optimization_paul_graham_managed already exists, loading from the storage


Launch Deep Memory training

In [None]:
from langchain.embeddings.openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()

job_id = vector_store.vectorstore.deep_memory.train(
    queries=train_queries,
    relevance=train_relevance,
    embedding_function=embeddings.embed_documents,
)

In [None]:
vector_store.vectorstore.deep_memory.status('657f24c06c0f27c75a5c186d')

#### Let's Generate a Test dataset to evaluate Deep Memory

In [None]:
from llama_index.evaluation import generate_question_context_pairs
# Generate test dataset
test_dataset = generate_question_context_pairs(
    nodes[:20],
    llm=llm,
    num_questions_per_chunk=1
)
test_dataset.save_json("test_dataset.json")

# We can also load the dataset from a json file if already done previously.
from llama_index.finetuning.embeddings.common import (
    EmbeddingQAFinetuneDataset,
)
test_dataset = EmbeddingQAFinetuneDataset.from_json(
    "test_dataset.json"
)

test_queries, test_relevance = create_query_relevance(test_dataset)

100%|██████████| 20/20 [00:29<00:00,  1.49s/it]


In [None]:
# Evaluate recall on the generated test dataset
recalls = vector_store.vectorstore.deep_memory.evaluate(
    queries=test_queries,
    relevance=test_relevance,
    embedding_function=embeddings.embed_documents,
)

Embedding queries took 0.82 seconds
---- Evaluating without Deep Memory ---- 
Recall@1:	  45.2%
Recall@3:	  78.6%
Recall@5:	  90.5%
Recall@10:	  95.2%
Recall@50:	  100.0%
Recall@100:	  100.0%
---- Evaluating with Deep Memory ---- 
Recall@1:	  45.2%
Recall@3:	  83.3%
Recall@5:	  92.9%
Recall@10:	  95.2%
Recall@50:	  100.0%
Recall@100:	  100.0%


In [None]:
import os
from llama_index.postprocessor.cohere_rerank import CohereRerank
from llama_index.evaluation import (
    RetrieverEvaluator,
)

base_retriever = vector_index.as_retriever(similarity_top_k=10)
deep_memory_retriever = vector_index.as_retriever(
similarity_top_k=10, vector_store_kwargs={"deep_memory": True}
)

base_retriever_evaluator = RetrieverEvaluator.from_metric_names(
    ["mrr", "hit_rate"], retriever=base_retriever
)
eval_results = await base_retriever_evaluator.aevaluate_dataset(test_dataset)
print(display_results_retriever("Retriever Results", eval_results))

      Retriever Name  Hit Rate       MRR
0  Retriever Results  0.952381  0.641761


In [None]:
deep_memory_retriever = vector_index.as_retriever(
similarity_top_k=10, vector_store_kwargs={"deep_memory": True}
)

dm_retriever_evaluator = RetrieverEvaluator.from_metric_names(
    ["mrr", "hit_rate"], retriever=deep_memory_retriever
)
dm_eval_results = await dm_retriever_evaluator.aevaluate_dataset(test_dataset)
print(display_results_retriever("Retriever Results", dm_eval_results))

      Retriever Name  Hit Rate       MRR
0  Retriever Results  0.952381  0.661376


 Deep Memory Inference

In [None]:
query_engine = vector_index.as_query_engine(
    vector_store_kwargs={"deep_memory": True}
)
response = query_engine.query(
    "What are the main things Paul worked on before college?"
)
print(response)

Before college, Paul worked on applying to art schools, specifically RISD in the US and the Accademia di Belli Arti in Florence. He also applied for the BFA program at RISD and had to do the foundation classes in fundamental subjects like drawing, color, and design.


## Conclusion

In this notebook, we have explored how to build and evaluate a RAG pipeline using LlamaIndex, with a specific focus on evaluating the retrieval system and generated responses within the pipeline.