# Introduction

We now focus on optimizing a LlamaIndex RAG pipeline through a series of iterative evaluations. The ste-by-step plan is: 

-   **Baseline Evaluation**: Construct a standard LlamaIndex RAG pipeline and establish an initial performance baseline. We will adjust the top-k retrieval values to understand their effects on the accuracy and relevance of generated answers.
-   **Testing Different Embedding Models**: Evaluate different embedding models to identify the most effective on efor our pipeline.
-   **Incorporating a Reranker**: Implement a reranking mechanism to refine the document selection process of the retriever.
-   **Employing a Deep memory Approach**: Investigate the impact of a deep memory component on the accuracy of information retrieval.*(This part requires a paid subscription to ActiveLoop)*



## Baseline Evaluation

First, download the data

In [None]:
import wget
import os

file_path = os.path.join('paul_graham', 'paul_graham_essay.txt')
wget.download('https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt', out=file_path)

Change the setting to execute the open-source models

In [19]:
from llama_index.core import Settings
from langchain_ollama import OllamaEmbeddings
from langchain_ollama import OllamaLLM

Settings.embed_model = OllamaEmbeddings(model="llama3.1:8b") # Load it into the setting of llama index
Settings.llm = OllamaLLM(model="llama3.1:8b")

Create Llamaindex nodes

In [20]:
from llama_index.core.node_parser import SimpleNodeParser
from llama_index.core import SimpleDirectoryReader

# Load the download documents
documents = SimpleDirectoryReader("./paul_graham/").load_data()
node_parser = SimpleNodeParser.from_defaults(chunk_size=512)
# Get nodes from documents
nodes = node_parser.get_nodes_from_documents(documents=documents)

# By default, the node/chunks ids are set to random uuids. To ensure same id's per run, we manually set them.
for idx, node in enumerate(nodes):
    node.id_ = f"node_{idx}"

The next step is to create a LlamaIndex `VectorStoreIndex` and store the embeddings into `DeepLakeVectorStore`. 

In [21]:
from llama_index.vector_stores.deeplake import DeepLakeVectorStore
from llama_index.core import ServiceContext, StorageContext, VectorStoreIndex

# Create a local deep Deep Lake VectorStore
dataset_path = "./ddbb/paul_graham"
vector_store = DeepLakeVectorStore(dataset_path=dataset_path, overwrite=True)

storage_context = StorageContext.from_defaults(vector_store=vector_store)
# Create and save the embeddings from the nodes
vector_index = VectorStoreIndex(nodes, settings=Settings, storage_context=storage_context, show_progress=True)

Generating embeddings: 100%|██████████| 129/129 [00:43<00:00,  2.95it/s]

Uploading data to deeplake dataset.



100%|██████████| 129/129 [00:00<00:00, 295.30it/s]

Dataset(path='./ddbb/paul_graham', tensors=['text', 'metadata', 'embedding', 'id'])

  tensor      htype       shape      dtype  compression
  -------    -------     -------    -------  ------- 
   text       text      (129, 1)      str     None   
 metadata     json      (129, 1)      str     None   
 embedding  embedding  (129, 4096)  float32   None   
    id        text      (129, 1)      str     None   





Now we can build a `QueryEngine`, which generates answers with the LLM and the retrieved chunks of text.

In [22]:
query_engine = vector_index.as_query_engine(similarity_top_k=10)
response_vector = query_engine.query("What are the main things Paul worked on before college?")
print(response_vector.response)

The main things Paul worked on before college were:

* Learning programming through interactions with friends who had microcomputers (specifically a Heathkit kit and later a TRS-80)
* Writing simple programs for his own use, such as games, model rocket predictions, and a word processor
* Assembling and using a home-built computer kit

Note: The original context of Paul's essay was about how he learned to program before college, but it also mentioned that philosophy was his intended college major. However, the new context provided (which is actually the same as the original text) further emphasizes Paul's early interest in programming and his continued engagement with it throughout his life.


Once we have a simple RAG pipeline, we can evaluate it. For that, we need a dataset. `LlamaIndex` offers a `generate_question_context_pairs` module specifically for generating questions and context pairs. We will use that dataset to assess the RAG chunk retrieval and response capabilities.

Let’s also save the generated dataset in JSON format for later use. In this case we only generate 58 question and context pairs, but you can increase the number of samples in the dataset for a more thorough evaluation. 

In [23]:
from llama_index.core.evaluation import generate_question_context_pairs

qc_dataset = generate_question_context_pairs(
    nodes,
    llm=Settings.llm,
    num_questions_per_chunk=1
)
# Save the results
qc_dataset.save_json("qc_dataset.json")


100%|██████████| 129/129 [12:11<00:00,  5.67s/it]


Load the dataset. We have generated 128 questions, one for each chunk as we set in `num_questions_per_chunk`. This questions are about the documents we have loaded.

In [24]:
from llama_index.core.evaluation import EmbeddingQAFinetuneDataset

# Load the results
qc_dataset = EmbeddingQAFinetuneDataset.from_json(
    "qc_dataset.json"
)

And we start with the retrieval evaluations. We will use the `RetrieverEvaluator` class available in LlamaIndex to measure:

-   *Hit Rate*: Measures how often you guess the correct answer by only looking at your top few guesses. You have a high hit rate if you often find the right answer in your first few guesses.So, in a retrieval system, it's about how frequently the system finds the correct document within its top 'k' picks.
-   *Mean Reciprocal Rank(MMR)*: For a retrieval system, MRR looks at where the correct document ranks in the system's guesses. If it's usually near the top, the MRR will be high, indicating good performance. 

In summary, **Hit Rate tells you how often the system gets it right in its top guesses, and MRR tells you how close to the top the right answer usually is**.


In [25]:
import pandas as pd
from llama_index.core.evaluation import RetrieverEvaluator

def display_results_retriever(name, eval_results):
    """Display results from the evaluate"""
    metric_dicts = []
    for eval_result in eval_results:
        metric_dict = eval_result.metric_vals_dict
        metric_dicts.append(metric_dict)

    full_df = pd.DataFrame(metric_dicts)
    hit_rate = full_df["hit_rate"].mean()
    mrr = full_df["mrr"].mean()
    metric_df = pd.DataFrame(
        {"Retriever Name": [name], "Hit Rate": [hit_rate], "MRR": [mrr]}
    )

    return metric_df

for i in [2,4,6,8,10]:
    # Create a retriever that returns the top-k similar nodes
    retriever = vector_index.as_retriever(similarity_top_k=i)
    # Create a retriever evaluator object
    retriever_evaluator = RetrieverEvaluator.from_metric_names(
        ["mrr", "hit_rate"], retriever=retriever
    )
    eval_results = await retriever_evaluator.aevaluate_dataset(qc_dataset)
    print(display_results_retriever(f"Retriever top_{i}", eval_results))

    Retriever Name  Hit Rate       MRR
0  Retriever top_2  0.015504  0.007752
    Retriever Name  Hit Rate      MRR
0  Retriever top_4  0.031008  0.01292
    Retriever Name  Hit Rate      MRR
0  Retriever top_6   0.03876  0.01447
    Retriever Name  Hit Rate       MRR
0  Retriever top_8  0.046512  0.015578
     Retriever Name  Hit Rate     MRR
0  Retriever top_10  0.062016  0.0173


We notice that the hit rate increases as the top_k value increases. But how does that impact the quality of the generated asnwers?

Now, we will evaluate **relevancy** and **faithfulness**:

-   *Relevancy*: Evaluates whether the retrieved context and answer is relevant to the query.
-   *Faithfulness*: Evaluates if the answer is faithful of if there's hallucination.

To execute this, we will use a bigger open-source model, **Mistral-Nemo 12b**.

In [26]:
from llama_index.core.evaluation import RelevancyEvaluator, FaithfulnessEvaluator, BatchEvalRunner

# Ajustar el ciclo para evitar el uso de ServiceContext
for i in [2, 4, 6, 8, 10]:
    # Query engine with the Ollama model
    query_engine = vector_index.as_query_engine(similarity_top_k=i)

    
    llm_mistral = OllamaLLM(model="mistral-nemo:latest")

    # Evaluating systems with a bigger LLM model
    faithfulness_evaluator = FaithfulnessEvaluator(llm=llm_mistral)
    relevancy_evaluator = RelevancyEvaluator(llm=llm_mistral)

    # Queries
    queries = list(qc_dataset.queries.values())
    batch_eval_queries = queries[:20]

    # Configurate evaluator by batches with the Mistral 12b model
    runner = BatchEvalRunner(
        {"faithfulness": faithfulness_evaluator, "relevancy": relevancy_evaluator},
        workers=8,
    )

    # Execute evaluations
    eval_results = await runner.aevaluate_queries(
        query_engine, # Query engine with the Ollama 8B model
        queries=batch_eval_queries
    )

    # Calculate and show the results
    faithfulness_score = sum(result.passing for result in eval_results['faithfulness']) / len(eval_results['faithfulness'])
    print(f"top_{i} faithfulness_score: {faithfulness_score}")

    relevancy_score = sum(result.passing for result in eval_results['relevancy']) / len(eval_results['relevancy'])
    print(f"top_{i} relevancy_score: {relevancy_score}")


AttributeError: 'NoneType' object has no attribute 'model_name'