# Introduction

In this notebook we will **explore in depth and compare** some optimization techniques, as well as their contributes and challenges: 

-   *Prompt engineering*
-   *Retrieval Augmented Generation(RAG)*
-   *Fine-tuning*



# Prompt engineering

This approach alone can be sufficient, especially for simpler or well-defined tasks. Techniques like **few-shot prompting** can notably improve task performance. This method involves providing small task-specific examples to guide the LLM. **Chain of Thought (CoT)** prompting can also improve reasoning capabilities and encourage the model to generate more detailed responses.

# Fine-tuning

Fine-tuning enables models to perform tasks like extracting JSON-formatted data from text, translating natural language into SQL queries, or adopting a specific writing style.

Fine-tuning demands a large, high-quality, task-specific dataset for effective training. You can start with a small dataset and training to see if the method works for your task.

It's also not the best choice for incorporating new information into the model

# Retrieval Augmented Generation(RAG). Structure explanation and advanced techniques.

## Description of techniques to improve results

RAG specializes in incorporating external knowledge, enabling the model to access current and varied information. This technique has the following keypoints to consider : 

-   *Real-Time Updates*: It is more adept at dealing with evolving datasets and can provide more up-to-date responses.
-   *Complexity in Integration*: Setting up a RAG system is more complex than basic prompting.
-   *Data Managment*: Managing and updating the external data sources is crucial for maintaining the accuracy and relevance of its outputs.
-   *Retrieval accuracy*: Ensuring precise embedding retrieval is crucial in RAG systems to guarantee reliable and comprehensive responses to user queries. For that, **we will demonstrate how Activeloop’s Deep Memory method can greatly increase the recall of embedding retrieval**.

In the next steps, we will explore some techniques to improve RAG. The process of querying in LlamaIndex is structured around this key componentes : 

-   *Retrievers*: Class designed to retrieve a set of nodes from an index based on a query.
-   *Query Engine*: Class that process the query and return a response object. Uses the retrieveres to find relevant data and uses the response synthesizer to create the final answer.
-   *Query Transform*: Class that improves the original query.

This components can improve the performance of the RAG solution. However, there's also some other advanced techniques that can be implemented too:

-   *Query Construction*: The techniques is focused on convert the user query to a format more appropiated with different data sources. Can be implemented different approaches:
    -   *MetadataFilter* classes: An autoretriever that translates natural language into unstructured queries.
    -   *Text-to-SQL*: For relational databases. Converts natural language to SQL requests.However, can appear problems like hallucination. To avoid this issue, an accurate description of the database and some few shots of example should be providen the LLM. 
-   *Query Expansion*: Add phrases or another data to the user query to improve the search of data. It's useful when the original query is too short or not very specific. One approach to do it is utilizing the `synonym_expand_policy` from the `KnowledgeGraphRAGRetriever` class. This technique combined with *Query Transformation* is so useful.
-   *Query Tranformation*: Modifies the original query to make it more effective. This transformations include changes in the query's structure, the use of synonyms or the inclusion of contextual information.

To practice with an example, we will implement a query engine:

First, we download a text file that serves as source document. Is a compilation of all the essays Paul Graham wrote on his blog.

## Example of SubQuestionoQueryEngine

In [1]:
import wget
import os

os.makedirs('paul_graham', exist_ok=True)
file_path = os.path.join('paul_graham', 'paul_graham_essay.txt')
wget.download('https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt', out=file_path)

'paul_graham\\paul_graham_essay (1).txt'

In [2]:
from llama_index.core import Settings
from langchain_ollama import OllamaEmbeddings
from langchain_ollama import OllamaLLM

Settings.embed_model = OllamaEmbeddings(model="llama3.1:8b") # Load it into the setting of llama index
Settings.llm = OllamaLLM(model="llama3.1:8b")

We use the `SimpleDirectoryReader` class to read al the documents of a given directory automatically.

After that, we add some changes to the settings to fix the size of the chunks and the overlap between them.

In [3]:
from llama_index.core import SimpleDirectoryReader

documents = SimpleDirectoryReader("./paul_graham").load_data()


Settings.chunk_size = 512
Settings.chunk_overlap = 64
node_parser = Settings.node_parser
nodes = node_parser.get_nodes_from_documents(documents=documents)

Once we got the nodes with the `node_parser` object, we store them into a vector store database.

In [8]:
from llama_index.vector_stores.deeplake import DeepLakeVectorStore
import json
from llama_index.core.storage.storage_context import StorageContext
from llama_index.core import VectorStoreIndex

# CHANGE THIS CODE TO LOAD YOUR CREDENTIALS
with open("../data/keys.json", "rb") as file:
    data = json.load(file)
    activeloop_org_id = data['NameOrg']
    activeloop_dataset_name = "LlamaIndex_paulgraham_essays"
    dataset_path = f"hub://{activeloop_org_id}/{activeloop_dataset_name}"
    os.environ['ACTIVELOOP_TOKEN'] = data['ActiveLoopKey']


# Create a vector store into ActiveLoop cloud
vector_store = DeepLakeVectorStore(dataset_path=dataset_path, overwrite=False)
# Create the StorageContext object
storage_context = StorageContext.from_defaults(vector_store=vector_store)
# Add the nodes obtained
storage_context.docstore.add_documents(nodes)
# Create the VectorStore index
vector_index = VectorStoreIndex(nodes, storage_context=storage_context)




Deep Lake Dataset in hub://alejandrotormun/LlamaIndex_paulgraham_essays already exists, loading from the storage
Uploading data to deeplake dataset.


  0%|          | 0/88 [00:27<?, ?it/s]


KeyboardInterrupt: 

Once we have created the index, it can serves as the basis for defining the query engine. We initiate the query engine by using the vector index object with the method `.as_query_engine()`. 

In [12]:
query_engine = vector_index.as_query_engine(similarity_top_k=15)
response = query_engine.query("What does Paul Graham do?")
print(response.response)

{"entity": "founder_of_The_YCombinatorstartup_accelerator", "reason": "Paul Graham is mentioned as 'I' in the essay, stating that he kept working on YC (YC = The Y Combinator) till March 2014."}


We can improve the previous code, giving some metadata, and using the **Sub Question Query Engine**, a querying method that can generate several sub-questions from the user's main question : 

In [13]:
import nest_asyncio
nest_asyncio.apply()  # Apply the patch for nested event loops

Something important to consider is that this technique works with JSON format. Since we are using Ollama models, we have to specify to use this format on model's answers.

In [14]:
from llama_index.core import Settings

Settings.llm = OllamaLLM(model="llama3.1:8b", format="json")

In [15]:
from llama_index.core.tools import QueryEngineTool, ToolMetadata
from llama_index.core.query_engine import SubQuestionQueryEngine

query_engine_tools = [
    QueryEngineTool(
        query_engine=query_engine,
        metadata=ToolMetadata(
            name="pg_essay",
            description="Paul Graham essay on What I Worked On"
        )
    )
]

query_engine = SubQuestionQueryEngine.from_defaults(
    query_engine_tools=query_engine_tools,
    use_async=True
)

response = query_engine.query(
    "How was Paul Grahams life different before, during, and after 'YC'?"
)
print( ">>> The final response:\n", response )

Generated 3 sub questions.
[1;3;38;2;237;90;200m[pg_essay] Q: What did Paul Graham work on before joining YC?
[0m[1;3;38;2;90;149;237m[pg_essay] Q: How was Paul Grahams life during his time at YC?
[0m[1;3;38;2;11;159;203m[pg_essay] Q: What did Paul Graham do after leaving YC?
[0m[1;3;38;2;90;149;237m[pg_essay] A: { 
"Paul Graham's life during his time at Y Combinator reflected his independent-mindedness and ability to adapt to rapid change. He was less influenced by conventional VC practices and customs, which were still based on old constraints despite significant changes in the world. This led him to take unconventional steps, such as renaming the organization after a mathematical concept (the Y combinator), adopting an untraditional color scheme (orange), and transitioning back to self-funding even when it became successful. Graham's perspective also acknowledged that customary VC practices were outdated and that Y Combinator aimed to challenge this status quo by fostering ne

As we have seen, we get a JSON format response. We can remove that format with the following code : 

In [21]:
print(response)

{ "Before YC his life was marked by independent-mindedness, where he worked on writing essays and revisited working on Lisp in his free time. He had a flexible schedule, which allowed him to pursue multiple interests simultaneously. His work-life balance was maintained through these various projects." 
 : "During YC, Paul Graham continued to embody his independent nature despite being part of the startup accelerator. He introduced unconventional elements such as renaming the organization and adopting an orange color scheme, reflecting his resistance to conventional VC practices and customs. This period also saw him transition back to self-funding even when Y Combinator became successful." 
 , "After leaving YC, Paul Graham's life took a different turn, marked by his pursuit of new interests outside of Y Combinator. He started painting full-time, which was a deliberate choice to see how good he could get at it, showcasing his continued interest in trying new things and exploring his cre

In [19]:
response_dict = json.loads(str(response))
plain_text_lines = []

for key, value in response_dict.items():
    plain_text_lines.append(f"{key}: {value}")

plain_text_response = "\n".join(plain_text_lines)

print(plain_text_response)

Before YC his life was marked by independent-mindedness, where he worked on writing essays and revisited working on Lisp in his free time. He had a flexible schedule, which allowed him to pursue multiple interests simultaneously. His work-life balance was maintained through these various projects.: During YC, Paul Graham continued to embody his independent nature despite being part of the startup accelerator. He introduced unconventional elements such as renaming the organization and adopting an orange color scheme, reflecting his resistance to conventional VC practices and customs. This period also saw him transition back to self-funding even when Y Combinator became successful.
After leaving YC, Paul Graham's life took a different turn, marked by his pursuit of new interests outside of Y Combinator. He started painting full-time, which was a deliberate choice to see how good he could get at it, showcasing his continued interest in trying new things and exploring his creativity.: Furt

## Custom Retrievers and Reranking with FlaskRank

We can keep improving the quality of the response by creating a **custom retriever**. Custom retrievers are a combination of different retriever styles. The `RetrieverQueryEngine` class operates with a designed retriever. There are two main `RetrieverQueryEngine` types:

-   *VectorIndexRetriever*: Fetches the top-k nodes that are most similar to the query. It's ideal for situations where precision and relevance to the specific query are paramount, like in detailed research or topic-specific inquiries.
-   *SummaryIndexRetriever*: Retrieves all nodes related to the query without prioritazing their relevance. This approach is less concerned with aligning closely to the specific context of the question and more about providing a broad overview. It's useful in scenarios where a comprehensive sweep of information is needed, regardless of the direct relevance to the specific terms of the query.

While any retrieval mechanism is capable of extracting multiple chunks from a large document, there's always some irrelevant candidates between the nodes selected. Reranking is re-evaluating and re-ordering search results to present the most relevant options. The **Cohere Reranker** improves the performance of retrieving close content. It sorts the search results according to their relevance to the query. **However, that solution is not open-source, while FlashRank is an oper-source rerank solution.**

The process begins with grouping documents into batches, after which the LLM evaluates each batch, giving a **relevance score to each document**. The final step in the reranking process involves aggregating the most relevant documents from all these batches to form the final retrieval response.

In the following code we show how to use it:

In [5]:
retriever

<llama_index.core.indices.vector_store.retrievers.retriever.VectorIndexRetriever at 0x250cc8dd8d0>

In [11]:
from langchain.retrievers.document_compressors import FlashrankRerank
from langchain.retrievers import ContextualCompressionRetriever
from llama_index.core import VectorStoreIndex, Document


query = "What is the capital of the United States?"
documents = [
   "Carson City is the capital city of the American state of Nevada. At the  2010 United States Census, Carson City had a population of 55,274.",
   "The Commonwealth of the Northern Mariana Islands is a group of islands in the Pacific Ocean that are a political division controlled by the United States. Its capital is Saipan.",
   "Charlotte Amalie is the capital and largest city of the United States Virgin Islands. It has about 20,000 people. The city is on the island of Saint Thomas.",
   "Washington, D.C. (also known as simply Washington or D.C., and officially as the District of Columbia) is the capital of the United States. It is a federal district. ",
   "Capital punishment (the death penalty) has existed in the United States since before the United States was a country. As of 2017, capital punishment is legal in 30 of the 50 states.",
   "North Dakota is a state in the United States. 672,591 people lived in North Dakota in the year 2010. The capital and seat of government is Bismarck."
   ]

# Convert raw strings into Document objects
doc_objects = [Document(text=doc) for doc in documents]

# Create an index from the document objects
index = VectorStoreIndex.from_documents(documents=doc_objects)
# Create retriever from the index
retriever = index.as_retriever()

# Create the compression retriever
compressor = FlashrankRerank()
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever = retriever
)
compressed_docs = compression_retriever.invoke(
    "What did the president say about Ketanji Jackson Brown"
)


INFO:httpx:HTTP Request: POST http://127.0.0.1:11434/api/embed "HTTP/1.1 200 OK"


ValidationError: 1 validation error for ContextualCompressionRetriever
base_retriever
  Input should be an instance of Runnable [type=is_instance_of, input_value=<llama_index.core.indices...t at 0x0000025107E1E7D0>, input_type=VectorIndexRetriever]
    For further information visit https://errors.pydantic.dev/2.9/v/is_instance_of

In [55]:
from llama_index.postprocessor.rankgpt_rerank import RankGPTRerank
from langchain.retrievers.document_compressors import FlashrankRerank
from llama_index.core import VectorStoreIndex, Document
from llama_index.core import QueryBundle

query = "What is the capital of the United States?"
documents = [
   "Carson City is the capital city of the American state of Nevada. At the  2010 United States Census, Carson City had a population of 55,274.",
   "The Commonwealth of the Northern Mariana Islands is a group of islands in the Pacific Ocean that are a political division controlled by the United States. Its capital is Saipan.",
   "Charlotte Amalie is the capital and largest city of the United States Virgin Islands. It has about 20,000 people. The city is on the island of Saint Thomas.",
   "Washington, D.C. (also known as simply Washington or D.C., and officially as the District of Columbia) is the capital of the United States. It is a federal district. ",
   "Capital punishment (the death penalty) has existed in the United States since before the United States was a country. As of 2017, capital punishment is legal in 30 of the 50 states.",
   "North Dakota is a state in the United States. 672,591 people lived in North Dakota in the year 2010. The capital and seat of government is Bismarck."
   ]

# Convert raw strings into Document objects
doc_objects = [Document(text=doc) for doc in documents]

# Create an index from the document objects
index = VectorStoreIndex.from_documents(documents=doc_objects)
# Create retriever from the index
retriever = index.as_retriever()

# We create the ReRank object
reranker = RankGPTRerank(
    llm=Settings.llm,
    top_n=3,
    verbose=False
)

query_engine_reranking = index.as_query_engine(similarity_top_k=10,
                      node_postprocessors=[reranker])

response = query_engine.query(
    "What is the capital of the United States?",
)
print("")
print(f"Response with reranking : {response}")

query_engine_no_reranking = index.as_query_engine(similarity_top_k=10)
response = query_engine.query(
    "What is the capital of the United States?",
)
print("")
print(f"Response without reranking : {response}")

INFO:httpx:HTTP Request: POST http://127.0.0.1:11434/api/embed "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://127.0.0.1:11434/api/embed "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://127.0.0.1:11434/api/generate "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://127.0.0.1:11434/api/generate "HTTP/1.1 200 OK"


After Reranking, new rank list for nodes: [2, 0, 5, 4, 3, 1]

INFO:httpx:HTTP Request: POST http://127.0.0.1:11434/api/embed "HTTP/1.1 200 OK"



Response with reranking : According to the context information:

"The capital of the United States. It is a federal district."

So, the answer is: Washington, D.C.


INFO:httpx:HTTP Request: POST http://127.0.0.1:11434/api/generate "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://127.0.0.1:11434/api/generate "HTTP/1.1 200 OK"


After Reranking, new rank list for nodes: [2, 0, 4, 5, 3, 1]
Response without reranking : According to the context information, the capital of the United States is Washington, D.C. (also known as simply Washington or D.C., and officially as the District of Columbia).


## Advanced Retrievals

Another alternative for retrieving relevant documents involves using document summaries instead of extracting fragmented snippets or brief text chunks. This technique offers a more thorough grasp of the subject. We will introduce two techniques : 

-   *Recursive Retrieval*: This technique is **useful for documents with a hierarchical structure**, allowing them to form connections and relations between nodes. **This is evident for PDF's documents**. Here's an example of it's implementation : [link](https://docs.llamaindex.ai/en/stable/examples/query_engine/pdf_tables/recursive_retriever/).
-   *Small-to-Big retrieval*: This techniques is divided in two steps : 
    -   *Initial small search*: Search of concise and short sentences. The objective of this step is to identify with precision the most relevant section, instead of analyzing the whole text since the beginning.
    -   *Ampliaton of context-big search*: Once the most relevant section is found, the context is expanded to get a better context(previous ans posterior text).

    This technique is useful when the initial query is very short,or when the relation between documents is very complex.

## Deep Memory

**Deep Memory** is a method developed by ActiveLoop to boost the accuracy of embedding retrieval for RAG systems in DeepLake vector store database. **Deep Memory trains a model that transforms embeddings into a space optimizied for your use case. This reconfiguration significantly improves vector search accuracy.**

Deep Memory is effective where query reformulation, query transformation, or documetn re-ranking might cause latency and increased token usage. It boosts retrieval capabilities without negatively impacting the system's performance.

In Deep Memory, we have the following steps : 

1.   *Embeddings*: Vector representation of a set of words. We create them using embedding models as we have seen previously.
2.   *Deep Memory Training*: A dataset of query and context pairs trains the Deep Memory model. This training process runs in Deep Lake Cloud, which provides the computational resources and infrastructure for handling the training.
3.   *Deep Memory Inference*: The model enters the inference phase, which transforms query embeddings.
4.   *Transformed Embeddings*: The result of the inference process is a set of transformed embeddings optimized for a specific use case.
5.   *Vector Search*: These optimized embeddings are used in vector search.

To put hands-on-practice, we will implement Deep Memory within our workflow. We shoul know that Deep Memory is a **premium feature in ActiveLoop paid plans**. We can use a free trial **(if you follow the ActiveLoop RAG course, you will get an extended free trial)** using GENAI360 promo code in you Deep Lake account.

**Note**: For personal projects on a local machine without any cost, is recommended to use **FAISS** as vector store database. It has been implemented for LlamaIndex.

# Mix of techniques

We can also combine some of the previous techniques, like **RAG+Fine-tuning**. With fine-tuning we can customize the model for a specific style, which can be useful for domains like medical, financial or any area that requires a highly specialized tone of writing. When combined with RAG, the model becomes adept in a specialized area, and gain access to a vast range of external information.

# Challenges of RAG systems

The main challenges of RAG system are these:

-   *Document updates*: When documents are modified, added or eliminated, the corresponding vector needs to be updated.
-   *Chunking and data distribution*: The granularity level is vital in achieving accuratte retrieval results. If the chunk size is too large, important details may be missed, and if it's too small, the system might get bogged down in details and miss bigger picture.
-   *Diverse Representations in Latent Space*: The presence of different representations(text versus tables or images) in the same latent space can be challenging. These diverse representations can cause conflicts.
-   *Compliance*: Non-compliance can lead to legal issues.

# Optimization techniques for RAG systems

Here we will presented several optimization strategies : 
-   *Model selection and hybrid retrieval*: Selecting appropriate models for the embedding and generation phases is critical. Choosing efficient and cheap embedding models can minimize costs while maintaining performance levels, but not in the generation process where an LLM is needed. 

    Combining different methods, like keyword and embedding retrieval with reranking, ensures that the system is fast enough to meet user expectations while still providing accurate results
-   *CPU-bases inference*: Intel®'s advanced optimization technologies help with the efficient fine-tuning and inference of neural network models on CPUs. The 4th Gen Intel® Xeon® . Scalable processors come with Intel® Advanced Matrix Extensions (Intel® AMX), an AI-enhanced acceleration feature. Each core of these processors includes integrated BF16 and INT8 accelerators, contributing to the acceleration of deep learning fine-tuning and inference speed. Additionally, libraries such as Intel Extension for PyTorch and Intel® Extension for Transformers further optimize the performance of neural network models demanding computations on CPUs.
-   *Retrieval performance*: We can get some failures during document retrieval, as individual segments may lack the broader context necessary to answer specific queries.  LlamaIndex offers features designed to construct a network of interlinked chunks (nodes), along with retrieval tools. These tools improve search capabilities by augmenting user queries, extracting key terms, or navigating through the connected nodes to locate the necessary information for answering queries.

    The LlamaIndex framework provides a variety of retrieval methods, complete with practical examples for different use cases, including the following examples, to name a few:
    -   Combining keyword + embedding search in a hybrid approach can enhance retrieval of specific queries. [link](https://docs.llamaindex.ai/en/stable/examples/query_engine/CustomRetrievers/)
    -   Metadata filtering can provide additional context and improve the performance of the RAG pipeline. [link](https://docs.llamaindex.ai/en/stable/examples/vector_stores/WeaviateIndexDemo/#metadata-filtering)
    -   Re-ranking orders the search results by considering the recency of data to the user’s input query. [link](https://docs.llamaindex.ai/en/stable/examples/node_postprocessor/CohereRerank/)
    -   Indexing documents by summaries and retrieving relevant information within the document. [link](https://docs.llamaindex.ai/en/stable/examples/index_structs/doc_summary/DocSummary/)

    

# RAG best practices

Here are some of the best practices for dealing with RAG:

-   *Fine-tuning the Embedding Model*: Initially, it’s necessary to get the training set, which can be done by generating synthetic questions/answers from random documents. The next phase is fine-tuning the model, where adjustments are made to optimize its functioning. Following this, the model can optionally undergo an evaluation process to assess its improvements. The reported numbers from LlamaIndex show that the fine-tuning process can yield a 5-10% improvement in retrieval metrics, enabling the enhanced model to be effectively integrated into RAG applications. You can read [here](https://docs.llamaindex.ai/en/stable/optimizing/fine-tuning/fine-tuning/#finetuning-embeddings) for more information.
-   *Evaluation*: Regularly monitoring the performance of your RAG pipeline is a recommended practice, as it allows for assessing changes and their impact on the overall results. While evaluating a model's response, which can be highly subjective, is challenging, there are several methods available to track progress effectively. LlamaIndex provides modules for assessing the quality of the generated results and the retrieval process [link](https://docs.llamaindex.ai/en/stable/optimizing/evaluation/evaluation.html). 
-   *Generative Feedback Loops*: A key aspect of generative feedback loops is injecting data into prompts. This process involves feeding specific data points into the RAG system to generate contextualized outputs. Once the RAG system generates descriptions or vector embeddings, these outputs can be stored in the database. The creation of a loop where generated data is continually used to enrich and update the database can improve the system's ability to produce better outputs.
-   *Hybrid Search*: It is essential to keep in mind that embedding-based retrieval is not always practical for entity lookup. Implementing a hybrid search that combines the benefits of keyword lookup with additional context from embeddings can yield better results, offering a balanced approach between specificity and context.