# Learn RAG From Scratch - Python AI from a LangChain Engineer
## 17/04/2024

<div style="background-color: #ADD8E6; padding: 10px; border: 1px solid #ADD8E6">
    
# Table of Contents
## Overview
## Basics
* [Indexing](#indexing)
* [Retrieval](#retrieval)
* [Generation](#generation)

## Advanced
* [Query Translation (Multi Query)](#query-translation-multi-query)
* [Query Translation (RAG Fusion)](#query-translation-rag-fusion)
* [Query Translation (Decomposition)](#query-translation-decomposition)
* [Query Translation (Step Back)](#query-translation-step-back)
* [Query Translation (HyDE)](#query-translation-hyde)
* [Routing](#routing)
* [Query Construction](#query-construction)
* [Indexing (Multi Representation)](#indexing-multi-representation)
* [Indexing (RAPTOR)](#indexing-raptor)
* [Indexing (ColBERT)](#indexing-colbert)
* [CRAG](#crag)
* [Adaptive RAG](#adaptive-rag)
* [The Future of RAG](#the-future-of-rag)

## Overview

In this course Lance Martin will teach you how to implement RAG from scratch. Lance is a software engineer at LangChain, and LangChain is one of the most common ways to implement RAG. Lance will help you understand how to use RAG to combine custom data with LLMs.

I’m going to show you a short course focused on RAG (Retrieval Augmented Generation) which is one of the most popular kind of ideas in applications in LLMs today. So, really the motivation for this is that most of the world’s data is private, whereas LLMs are trained on publicly available data.

Note: more than 95% of the world's data is "private", but we can "feed it" to LLMs.

<div style="text-align: center;">
<img src="RAG2.5.png" alt="Image" style="display: block; margin: 0 auto;">
<p style="text-align: center;"><b>Fig 1</b>: Comparison of LLMs / Pre-training VS Context Window (Number of Tokens)</p>
</div>

Of interest:

https://huggingface.co/blog/mixtral

https://x.com/RihardJarc/status/1778082161595208124

So, you can kind of see on the bottom of the x-axis the number of tokens used for pre-training various LLMs, so it can of varies from 1.5 trillion tokens in the case of smaller models like Phi-2, out to some very large number for proprietary models like GPT4 and Claude 3. But what’s really interesting is that the context window or the ability to feed external information into these LLMs is actually getting larger, so about 1 year ago, context window was between 4 and 8.000 tokens, you know that’s like maybe a dozen pages of text.

We’ve recently seen models all the way out to a million tokens which is thousands of pages of text, so while these LLMs are trained on large scale public data, it’s increasingly feasible to feed them, this huge mass of private data that they’ve never seen, that private data can be your of personal data, it can be corporate data, or other information that you want to pass to an LLM that’s not natively in this training set. 

So, that is the main motivation for RAG, that’s really the idea that LLMs; one, they are the center of a new kind of operating system; and two, it’s increasingly critical to be able to feed information from external sources such as private data into LLMs for processing, so that’s kind of the overarching motivation for RAG.

<div style="text-align: center;">
<img src="RAG3.5.png" alt="Image" style="display: block; margin: 0 auto;">
<p style="text-align: center;"><b>Fig 2</b>: Connecting LLM to external data is a central need</p>
</div>

Of interest:

https://x.com/karpathy/status/1707437820045062561

RAG refers to Retrieval Augmented Generation, you can think of it in three very general steps: there’s a process of indexing of external data, so you can think about this as building a database. Many companies already have large scale databases in different forms, they could be SQL DBS, relational DBS, or they could be vector stores, but the point is that documents are indexed, such that they can be retrieved based upon some heuristics relative to an input like a question, and those relevant documents can be passed to an LLM, and the LLM can produce answers that are grounded in that retrieved information, so that’s kind of the centerpiece or central idea behind RAG and why it’s really powerful technology, because it’s really uniting the knowledge and processing capacity of LLMs with large scale private external data source for which most of the important data in the world still lives. 

<div style="text-align: center;">
<img src="/RAG4.5.png" alt="Image" style="display: block; margin: 0 auto;">
<p style="text-align: center;"><b>Fig 3</b>: Retrieval Augmented Generation (RAG) General Diagram Flow</p>
</div>

Next, we’re going to kind of build up a complete understanding of the RAG landscape and covering a bunch of interesting papers and techniques that explain kind of how to do RAG, and I’ve really broken it down into a few different sections:

<div style="text-align: center;">
<img src="RAG1.png" alt="Image" style="display: block; margin: 0 auto;">
<p style="text-align: center;"><b>Fig 4</b>: General Workflow of RAG with LangChain</p>
</div>

- So, starting with a question on the left, the first kind of section is what I call Query Translation, so this captures a bunch of different methods to take a question from a user and modify it in some way to make it better suited for retrieval from one of these indexes we’ve talked about, that can use methods like query writing, it can be decomposing the query into constituent sub-questions. 

- Then, there’s a question of Routing, so, taking that decomposed re-written question and routing it to the right place, you might have Multiple Vector Stores, a Relational DB, a Graph DB, and a Vector Store, so, it’s the challenge of getting a question to the right source.

- Then, there’s kind of challenge of Query Construction which is basically taking Natural Language and converting it into the DSL (Domain Specific Language) necessary for whatever data source you want to work with, a classic example here is Text-to-SQL which is kind of a very well-studied process, but, Text-to-Cypher for Graph-DB is very interesting, Text-to-Metadata Filters for Vector-DBs is also a very big area of Study.   

- Then, there’s Indexing, so that’s the process of taking your documents and processing in some way they can be easily retrieved and there’s a bunch of techniques for that we’ll talk through, we’ll talk through different embedding methods, we’ll talk about different indexing strategies. 

- After, Retrieval, there are different techniques to re-rank or filter retrieved-documents.

- And then finally, we’ll talk about Generation, a kind of interesting new set of methods to do what we might call as Active RAG, so, in that Retrieval or Generation Stage, grade documents, grade answers, grade for relevance to the questions, grade for faithfulness to the documents, I.E check for hallucinations, and if either fail, feedback; re-retrieve or rewrite the question, re-generate the answer so forth, so there’s a really interesting set of methods we’re going to talk through that cover that like retrieval and generation with feedback.

## Basics

In terms of general outline, we’ll cover the basics first, it’ll go through Indexing, Retrieval, and Generation kind of in the bare bones, and then we’ll talk through more advanced techniques that we just saw on the prior slide:  Query Transformations, Routing, Query Construction, and so forth. 

### <a id="indexing"></a>Indexing

So, in the previous section we saw the main overall components of RAG pipelines: Indexing, Retrieval, and Generation, and here we’re going to kind of deep dive on indexing and give a quick overview of it.

<div style="text-align: center;">
<img src="RAG8.5.png" alt="Image" style="display: block; margin: 0 auto;">
<p style="text-align: center;"><b>Fig 5</b>: Indexing Stage</p>
</div>

The first aspect of indexing is we have some external documents that we actually want to load and put into what we’re trying to call Retriever, and the goal of this Retriever is simply given an input question I want to fish out documents that are related to my question in some way.

<div style="text-align: center;">
<img src="RAG11.5.png" alt="Image" style="display: block; margin: 0 auto;">
<p style="text-align: center;"><b>Fig 6</b>: Document Loading</p>
</div>

Now, the way to establish that relationship, or relevance, or similarity is typically done using some kind of numerical representation of documents, and the reason is that is very easy to compare vectors, for example, of numbers relative to just free form text, and so, a lot of approaches have been developed over the years to take text documents and compress them down into a numerical representation that then can be very easily searched.

<div style="text-align: center;">
<img src="RAG12.5.png" alt="Image" style="display: block; margin: 0 auto;">
<p style="text-align: center;"><b>Fig 7</b>: Numerical Representation for Search</p>
</div>

There’s a few ways to do that, so Google and others came up with many interesting statistical where you take a document, look at the frequency of words, and you build what they call Sparse Vectors, such that the vector locations are a large vocabulary of possible words, each value represents the number of occurrences of that particular word, and it’s sparse because there’s of course many zeros, it’s a very large vocabulary relative to what’s present in the document, and there’s very good search methods over this type of numerical representation. Now, a bit more recently embedding methods that are machine learned, so you take a document and you build a compressed fixed length representation of that document have been developed with correspondingly very strong search methods over embeddings. 

<div style="text-align: center;">
<img src="RAG13.5.png" alt="Image" style="display: block; margin: 0 auto;">
<p style="text-align: center;"><b>Fig 8</b>: Statistical and Machhine Learned Representations</p>
</div>

So, the intuition here is that we take documents, and we typically split them because embedding models actually have limited Context Windows, so, in the order of maybe 512 tokens, up to 8.000 tokens or beyond, but they are nor infinitely large, so documents are split, and each document is compressed into a vector, and that vector captures a semantic meaning of the document itself. The questions are indexed vectors that can be embedded in the exactly same, and the numerical comparison in some form, using very different types of methods can be performed on these vectors to fish out relevant documents relative to my question. 

<div style="text-align: center;">
<img src="RAG14.5.png" alt="Image" style="display: block; margin: 0 auto;">
<p style="text-align: center;"><b>Fig 9</b>: Loading, Splitting and Embedding (Index makes documents easy to retrieve)</p>
</div>

## RAG From Scratch: Parts 1-4

Let’s just do a quick code walk through on some of these points. I’ve installed here some packages.

### Environment

(1) Packages

In [1]:
!pip install langchain_community tiktoken langchain-openai langchainhub chromadb langchain

Collecting langchain_community
  Downloading langchain_community-0.3.16-py3-none-any.whl (2.5 MB)
     ---------------------------------------- 2.5/2.5 MB 1.4 MB/s eta 0:00:00
Collecting tiktoken
  Downloading tiktoken-0.8.0-cp310-cp310-win_amd64.whl (884 kB)
     -------------------------------------- 884.2/884.2 kB 1.3 MB/s eta 0:00:00
Collecting langchain-openai
  Downloading langchain_openai-0.3.2-py3-none-any.whl (54 kB)
     ---------------------------------------- 54.4/54.4 kB 2.9 MB/s eta 0:00:00
Collecting langchainhub
  Downloading langchainhub-0.1.21-py3-none-any.whl (5.2 kB)
Collecting chromadb
  Downloading chromadb-0.6.3-py3-none-any.whl (611 kB)
     -------------------------------------- 611.1/611.1 kB 1.5 MB/s eta 0:00:00
Collecting langchain
  Downloading langchain-0.3.16-py3-none-any.whl (1.0 MB)
     ---------------------------------------- 1.0/1.0 MB 1.5 MB/s eta 0:00:00
Collecting langsmith<0.4,>=0.1.125
  Downloading langsmith-0.3.2-py3-none-any.whl (333 kB)
    

ERROR: Could not install packages due to an OSError: [WinError 5] Access is denied: 'C:\\Users\\Daniel\\anaconda3\\Lib\\site-packages\\~standard\\backend_c.cp310-win_amd64.pyd'
Consider using the `--user` option or check the permissions.



I’ve set a few API keys for LangSmith which are very useful for tracing which we’ll see shortly 

(2) LangSmith

https://docs.smith.langchain.com/

In [None]:
os.environ['LANGCHAIN_TRACING_V2'] = 'true'
os.environ['LANGCHAIN_ENDPOINT'] = 'https://api.smith.langchain.com'
os.environ['LANGCHAIN_API_KEY'] = <your-api-key>

(3) API Keys

In [None]:
os.environ['OPENAI_API_KEY'] = <your-api-key>

### Part 1: Overview

In [None]:
import bs4
from langchain import hub
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import WebBaseLoader
from langchain_community.vectorstores import Chroma
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

#### INDEXING ####

#Load Documents
loader = WebBaseLoader(
    web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent",),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            class_=("post-content", "post-title", "post-header")        
        )    
    ),
)
docs = loader.load()

#Split
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(docs)

#Embed
vectorstore = Chroma.from_documents(documents=splits,
                                    embedding=OpenAIEmbeddings())

retriever = vectorstore.as_retriever()

#### RETRIEVAL and GENERATION ####

#Prompt
prompt = hub.pull("rlm/rag-prompt")

#LLM
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

# Post-processing
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

#Chain
rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

# Question
rag_chain.invoke("What is Task Decomposition?")

#### Codeium Detailed Explanation

Importing Libraries

The code starts by importing various libraries:

- bs4: a Python library for parsing HTML and XML documents.
- langchain: a library for building and interacting with large language models.
- langchain.text_splitter: a module for splitting text into chunks.
- langchain_community.document_loaders: a module for loading documents from various sources.
- langchain_community.vectorstores: a module for storing and retrieving vector embeddings.
- langchain_core.output_parsers: a module for parsing output from language models.
- langchain_core.runnables: a module for defining runnable tasks.
- langchain_openai: a module for interacting with OpenAI's language models.


Indexing

The code then defines an indexing pipeline, which consists of the following steps:

1. Loading Documents: The code uses the WebBaseLoader class to load documents from a specified URL. In this case, the URL is a GitHub page containing a blog post. The bs_kwargs parameter is used to specify the HTML elements to parse from the page. In this case, the code is interested in the post-content, post-title, and post-header elements.
2. Splitting Text: The code uses the RecursiveCharacterTextSplitter class to split the loaded documents into chunks. This is done to prepare the text for embedding and retrieval.


RecursiveCharacterTextSplitter

The RecursiveCharacterTextSplitter class is a text splitter that splits text into chunks based on a specified chunk size and overlap. The hyperparameters for this class are:

- chunk_size: The maximum size of each chunk in characters. In this case, the chunk size is set to 1000 characters.
- chunk_overlap: The amount of overlap between chunks in characters. In this case, the chunk overlap is set to 200 characters.

The chunk size and overlap are used to determine the boundaries of each chunk. The splitter will split the text into chunks of size chunk_size, but will also include chunk_overlap characters from the previous chunk in the next chunk. This is done to ensure that the chunks are not too small and to preserve context between chunks.

How to Change Chunk Size and Overlap

The choice of chunk size and overlap depends on the specific use case and the characteristics of the text data. Here are some general guidelines:

- Chunk Size: A larger chunk size can lead to better performance, but may also result in larger embeddings and slower retrieval times. A smaller chunk size can lead to faster retrieval times, but may also result in worse performance due to lack of context. A good starting point for chunk size is between 500-2000 characters.
- Chunk Overlap: A larger chunk overlap can help preserve context between chunks, but may also result in more redundant information being stored. A smaller chunk overlap can lead to less redundant information, but may also result in worse performance due to lack of context. A good starting point for chunk overlap is between 100-500 characters.

For example, if you have a large corpus of text and want to optimize for performance, you may want to increase the chunk size to 2000 characters and decrease the chunk overlap to 100 characters. On the other hand, if you have a small corpus of text and want to optimize for accuracy, you may want to decrease the chunk size to 500 characters and increase the chunk overlap to 500 characters.

Embedding and Retrieval

The code then defines an embedding and retrieval pipeline, which consists of the following steps:

1. Embedding: The code uses the Chroma class to create a vector store from the split documents. The OpenAIEmbeddings class is used to generate embeddings for each chunk.
2. Retrieval: The code uses the as_retriever method to create a retriever object from the vector store.

Retrieval and Generation

The code then defines a retrieval and generation pipeline, which consists of the following steps:

1. Prompt: The code uses the hub.pull method to retrieve a prompt from a remote repository.
2. LLM: The code uses the ChatOpenAI class to create a language model object.
3. Post-processing: The code defines a format_docs function to format the retrieved documents.
4. Chain: The code defines a chain of tasks that consists of the retriever, prompt, language model, and post-processing function.

Invocation

Finally, the code invokes the chain with a question "What is Task Decomposition?" and prints the response.

### Part 2: Indexing

And here what I’ll do is deep diving a little bit more on indexing and take a question and a document, and first, I’m just going to compute the number of tokens in for example the question, and this is interesting because embedding models in LLMs more generally operate on tokens, and so, it’s nice to know how large the documents are that I’m trying to feed in, in this case it’s obviously a very small in this case question.

In [None]:
# Documents
question = "What kinds of pets do I like?"
document = "My favorite pet is a cat"

[Count tokens](https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb) considering [~4 char / token](https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them)

In [None]:
import tiktoken

def num_tokens_from_string(string: str, encoding_name: str) -> int:
    """Returns the number of tokens in a text string."""
    encoding = tiktoken.get_encoding(encoding_name)
    num_tokens = len(encoding.encode(string))
    return num_tokens

num_tokens_from_string(question, "cl100k_base")

Now, OpenAIEmbeddings is going to be specified, this is where the embedding model is specified, and I just pass my question and my document to 'embed.embed_query' and what you can see here when it runs is that this is mapped to a vector of length 1536, and that fixed length vector representation will be computed for both, and really for any document, so, you are always computing this fix length that encodes the semantics of the text that you've passed.

[Text embedding models](https://python.langchain.com/docs/integrations/text_embedding/openai)

In [None]:
from langchain_openai import OpenAIEmbeddings
embd = OpenAIEmbeddings()
query_result = embd.embed_query(question)
document_result = embd.embed_query(document)
len(query_result)

Now I can do things like Cosine Similarity to compare them. [Cosine similarity](https://platform.openai.com/docs/guides/embeddings/frequently-asked-questions) is reccomended (1 indicates identical) for OpenAI embeddings.

In [None]:
import numpy as np

def cosine_similarity(vec1, vec2):
    dot_product = np.dot(vec1, vec2)
    norm_vec1 = np.linalg.norm(vec1)
    norm_vec2 = np.linalg.norm(vec2)
    return dot_product / (norm_vec1 * norm_vec2)

similarity = cosine_similarity(query_result, document_result)
print("Cosine Similarity:", similarity)

And as we'll see here, some documents can be loaded, this is just like we saw previously. 

[Document Loaders](https://python.langchain.com/docs/integrations/document_loaders/)

In [None]:
#### INDEXING ####

# Load Block
import bs4 
from langchain_community.document_loaders import WebBaseLoader
loader = WebBaseLoader(
    web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            class_=("post-content", "post-title", "post-header")        
        )    
    ),
)
blog_docs = loader.load()

It can be splitted

[Splitter](https://python.langchain.com/docs/modules/data_connection/document_transformers/recursive_text_splitter)

> This text splitter is the recommended one for generic text. It tries to split on them in order until the chunnks are small enough. The default list is ["\n\n", "\n", " ", ""]. This has the effect of trying to keep all paragraphs (and then sentences, and then words) together as long as possible, as those would generically to be the strongest semantically related pieces of text.

In [None]:
# Split
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=300
    chunk_overlap=50)

# Make splits
splits = text_splitter.split_documents(blog_docs)

[Vectorstores](https://python.langchain.com/docs/integrations/vectorstores/)

In [None]:
# Index
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

vectorstore = Chroma.from_documents(documents=splits,
                                   embedding=OpenAIEmbeddings())

retriever = vectorstore.as_retriever

### <a id="retrieval"></a>Retrieval

Next, we'll see how to actually do Retrieval by using this Vector Store. Previously I outlined Indexing and gave an overview of this flow which starts with indexing of our documents, retrieval documents relevant to our question and then generation of answers based on the retrieved documents. 

We saw in the indexing process, basically it makes documents easy to retrieve and it goes through a flow that basically looks like this: it takes our documents, it splits them in some way into these smaller chunks that can be easily embedded, those embeddings are then numerical representations of those documents that are easily searchable and they are stored in an index, when given a question that's also embedded the index performs a similarity search and returns splits that are relevant to the question. (See Fig 9.). 

Now, if we dig a little bit more under the hood, we can think about it like this: when we take a document and embed it, let's imagine that embedding just had three dimensions, so, each document is projected into some point in this 3D space:

<div style="text-align: center;">
<img src="RAG15.5.png" alt="Image" style="display: block; margin: 0 auto;">
<p style="text-align: center;"><b>Fig 10</b>: Retrieval Powered via Similarity Search</p>
</div>

Now, the point is that the location in space is determined by the semantic meaning or content in that document, so, to follow that documents in similar locations in space contain similar semantic information, and this very simple idea is really the cornerstone for a lot of search and retrieval methods that you'll see with modern Vector Stores. So, in particular, we take our documents, we embed them into this, in this case a toy 3D space, we take our question do the same, we can then do a search, like a local neighborhood search. You can think about it in this 3D space around our question to say: "Hey! What documents are nearby?" and these nearby neighbors are then retrieved because they have similar semantics relative to our question. 

And that's really what's going on here, so again, we took our documents, we split them, we embed them, and now they exist in this high dimensional space; we've taken our question, embedd it, projected it in that same space, and we just do a search around the question from nearby documents, and grab those ones that are close, and we can pick some number, for example we can say we want one, or two, or three, or N documents close to my question in this embedding space and there's a lot of really interesting methods that implement this very effectively. 

<div style="text-align: center;">
<img src="RAG16.5.png" alt="Image" style="display: block; margin: 0 auto;">
<p style="text-align: center;"><b>Fig 11</b>: Vectorstores Implement this for you</p>
</div>

And we have a lot of really nice integrations to play with this general idea, so many different embedding models, many different indexes, lots of document loaders, and lots of splitters that can be kind of recombined to test different ways of doing this indexing retrieval. 

<div style="text-align: center;">
<img src="RAG17.5.png" alt="Image" style="display: block; margin: 0 auto;">
<p style="text-align: center;"><b>Fig 12</b>: LangChain has many integrations to support this</p>
</div>

Now I'll show you a bit of code walkthrough. Here, in the slide we actually showed kind of that notion of search in that 3D space. A nice parameter to think about in building your retriever is **K**. So, **K** tells you the number of nearby neighbors to fetch when you do that retrieval process and we talked about in that 3D space. Do I want one nearby neighbor? or two? or three? so here we can specify K = 1 for example. Now we are building our index, so we're taking every split, embeding it, sroting it...

### Part 3: Retrieval

In [None]:
# Index
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
vectorstore = Chroma.from_documents(documents=splits,
                                   embedding=OpanAIEmbeddings())

retriever = vectorstore.as_retriever(search_kwargs={"k": 1})

Now, what's nice is that I asked a question: "What is task decomposition?" this is related to the blog post, and I'm going to run 'get_relevant_documents':

In [None]:
docs = retriever.get_relevant_documents("What is Task Decomposition?")

So, I run that and now, how many documents do i get back? I get one as expected upon **K** = 1.

In [None]:
len(docs)

So, this retrieved document should be related to my question. Here is when we can go to LangSmith and we can look at our retriever where we can see our question and the document we got back. (Pages 19-20 of word document). This document pertains to task decomposition in particular, and it kind of lays out a number of different approaches that can be used to do that. You can implement this by using KNN (K-Nearest-Neighbors) search really easily  just using a few lines of code.

### <a id="generation"></a>Generation
An important consideration in Generation is really what’s happening is we’re taking the documents you retrieve and we’re stuffing them into the LLM context window. So if we walk back through the process, we take documents, we split them for convenience or embedding, we then embed each split, and we store that in a Vector Store as this kind of easily searchable numerical representation or vector, and we take a question, embed it to produce a numerical representation, we can then search, for example, using something like KNN in this kind of high dimensional space for documents that are similar to our question based on their proximity or location in this space (as this example of toy 3D).

<div style="text-align: center;">
<img src="RAG18.5.png" alt="Image" style="display: block; margin: 0 auto;">
<p style="text-align: center;"><b>Fig 13</b>: Adding Docs to Context Window</p>
</div>

Now we’ve recovered relevant splits to our question, we pack them into the context window, and we produce our answers.

Now, this introduces the notion of a Prompt. A **Prompt** is a placeholder that has, for example, in our own case, **keys**, and those keys can be like **context** and **question**. So, they are basically kind of buckets in which we're going to take those retrieved documents and slot them in it, and we'ra going to take our question and also slot in it. If you walk through this flow you can see that we can build a dictionary from our retrieved documents and from our question, and then we can basically populate our prompt template with the values from the dictionary, and then becomes a prompt value which can be passed to the LLM like a chat model, resulting in chat messages which we then parse into a string and get our answers.

<div style="text-align: center;">
<img src="RAG19.5.png" alt="Image" style="display: block; margin: 0 auto;">
<p style="text-align: center;"><b>Fig 14</b>: Connecting Retrieval with LLMs via Prompt</p>
</div>

Now, let's walk through that in code. Here's the Generation bit, and you can see here it's defines something new, this is a **Prompt Template** and my Prompt Template is something really simple, it's just going to say: "Answer the following Question based on this Context":

### Part 4: Generation

In [None]:
from langchain_openai import ChatOpenAI
from langchain.promps import ChatPromptTemplate

#Prompt 
template = """"Answer the question based only on the following context:
{context}

Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)
prompt

Let's define the LLM:

In [4]:
# LLM
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

NameError: name 'ChatOpenAI' is not defined

Now this introduces the notion of Chain, so, in LangChain we have an expression language called LCEL (LangChain Expression Language) which lets you really easily compose things like prompts, LLMs, parsers, retrievers and other things. But the very simple idea here is just let's take our prompt which we defined right before and connect it to a LLM which was defined before in this chain. So, there's our chain

In [None]:
# Chain
chain = prompt | llm

Now, all we are doing is invoking that chain, so, every LangChain Expression Langage has a few common methods like 'invoke', and in this case, we invoke it with a dict, so, context and question that map to the expected keys in our template. It is going to execute the chain and get our answer. 

In [None]:
# Run
chain.invoke({"context":docs, "question":"What is Task Decomposition?"})

We can go to LangSmith now and here we should see a very simple runnable sequence. We should see our document, our output, our prompt that says "answer the following question based on the context".

There's a lot other options for RAG prompts. Next is a popular prompt just with a little bit more detail, but the intuition is the same. The prompt is: "You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question, if you don't know the answer, just say tou don't know. Use three sentences maximum and keep the answer concise.\nQuestion: {question} \nContext: {context} \nAnswer:"

In [None]:
from langchain import hub
prompt_hub_rag = hub.pull("rlm/rag-prompt")

You're passing in documents, you are asking them to reason about the documents, given a question producing an answer, and now I'm going to define a RAG Chain which will automatically do the retrieval for us, and all I have to do is specify: "Here is my retriever which we defined before. Here is our question which we invoke with. The question gets passed through the key question in our dictionary, and it automatically will trigger the retriever which will return documents which which get passed into our context" and now this is all automated for us. Now we pass that chain which is auto-populated into our prompt, LLM, auto-parser, now is invoked, and that should all just run.

[RAG chains](https://python.langchain.com/docs/expression_language/get_started#rag-search-example)

In [None]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassThrough

rag_chain = (
    {"context": retriever, "question": RunnablePassThrough()}
    | prompt
    | llm
    | StrOutputParser()
)

rag_chain.invoke("What is Task Decomposition?")

And great, we get an answer and we can look at the trace and see everything that happened by using LangSmith. 

## Advanced

### <a id="query-translation-multi-query"></a>Query Translation (Multi Query)
We’re going to talk about Query Translation, and specifically to cover the topic of multi-query. So, Query Translation sits at the first stage of an Advanced RAG pipeline, and the goal of Query Translation is really to take an input user question and translate it in some way in order to improve retrieval.

<div style="text-align: center;">
<img src="RAG5.png" alt="Image" style="display: block; margin: 0 auto;">
<p style="text-align: center;"><b>Fig 15</b>: Query Translation Stage</p>
</div>

So, the problem statement is pretty intuitive, user queries can be ambiguous and because we’re typically doing some kind of semantic similarity search between the query and our documents, if the query is poorly written or ill posed, we won’t retrieve the proper documents from our index.

<div style="text-align: center;">
<img src="RAG19.6.png" alt="Image" style="display: block; margin: 0 auto;">
<p style="text-align: center;"><b>Fig 16</b>: Tweet about difficulties on embedding long documents and ambiguous queries</p>
</div>

So, there’s a few approaches to attack this problem and you can group them in a few different ways. Here’s one way I like to think about it, a few approaches has involved Query Rewriting, so, taking a query and reframing it like writing from a different perspective, and that’s what we’re going to talk about a little bit here in depth, using approaches like Multi-Query, or RAG-Fusion which we’ll talk about in the next lesson.

You can also do things like taking a question and breaking it down to make it less abstract into Sub-Questions and there’s a bunch of interesting papers focused on that like Least-to-Most from Google; you can also take the opposite approach of take a question to make it more abstract, and there’s actually approaches we’re going to talk about later in a future chapter called Step-Back Prompting that focuses on higher level question from the input. 

<div style="text-align: center;">
<img src="RAG20.5.png" alt="Image" style="display: block; margin: 0 auto;">
<p style="text-align: center;"><b>Fig 17</b>: General Approaches to Transform Questions (Multi-Query)</p>
</div>

So, the intuition for this multi-query approach is that we’re taking a question and we’re going to break it down into a few differently worded questions from different perspectives.

<div style="text-align: center;">
<img src="RAG21.5.png" alt="Image" style="display: block; margin: 0 auto;">
<p style="text-align: center;"><b>Fig 18</b>: Transform a Question into Multiple Perspectives</p>
</div>

And the intuition here is simply that it is possible that the way a question is initially worded, once embedded, is not well aligned or in close proximity in this high-dimensional embedding space to a document that we want to retrieve that’s actually related, so the thinking is that by rewriting it in a few different ways you actually increase the likelihood of retrieving the document that you really want, because of nuances in the way that documents and questions are embedded, this kind of more shotgun approach of taking a question, fanning it out into a few different perspectives may improve and increase the reliability of retrieval.

<div style="text-align: center;">
<img src="RAG21.6.png" alt="Image" style="display: block; margin: 0 auto;">
<p style="text-align: center;"><b>Fig 19</b>: Intuition: Improve Search</p>
</div>

And of course we can combine this with Retrieval, so we can take our fan out questions, do retrieval on each one, combine them in some way and perform RAG.

<div style="text-align: center;">
<img src="RAG22.5 - Multi-Query - Query Translation.png" alt="Image" style="display: block; margin: 0 auto;">
<p style="text-align: center;"><b>Fig 20</b>: Multi-Query Diagram - use this with parallelized retrieval</p>
</div>

Now, let’s go to our code. After installing the packages and setting the LangChain API Keys which we’ll see why that’s quite useful shortly. 

Note: Next Environment Section is the same that previously written at the beginning

### Environment

In [5]:
!pip install langchain_community tiktoken langchain-openai langchainhub chromadb langchain

Collecting langchain_community
  Using cached langchain_community-0.3.16-py3-none-any.whl (2.5 MB)
Collecting tiktoken
  Using cached tiktoken-0.8.0-cp310-cp310-win_amd64.whl (884 kB)
Collecting langchain-openai
  Downloading langchain_openai-0.3.3-py3-none-any.whl (54 kB)
     -------------------------------------- 54.5/54.5 kB 471.0 kB/s eta 0:00:00
Collecting langchainhub
  Using cached langchainhub-0.1.21-py3-none-any.whl (5.2 kB)
Collecting chromadb
  Using cached chromadb-0.6.3-py3-none-any.whl (611 kB)
Collecting langchain
  Downloading langchain-0.3.17-py3-none-any.whl (1.0 MB)
     ---------------------------------------- 1.0/1.0 MB 4.0 MB/s eta 0:00:00
Collecting pydantic-settings<3.0.0,>=2.4.0
  Using cached pydantic_settings-2.7.1-py3-none-any.whl (29 kB)
Collecting tenacity!=8.4.0,<10,>=8.1.0
  Using cached tenacity-9.0.0-py3-none-any.whl (28 kB)
Collecting httpx-sse<0.5.0,>=0.4.0
  Using cached httpx_sse-0.4.0-py3-none-any.whl (7.8 kB)
Collecting dataclasses-json<0.7,>=0.

ERROR: Could not install packages due to an OSError: [Errno 13] Permission denied: 'c:\\users\\daniel\\anaconda3\\scripts\\pygmentize.exe'
Consider using the `--user` option or check the permissions.



I’ve set a few API keys for LangSmith which are very useful for tracing which we’ll see shortly 

(2) LangSmith

https://docs.smith.langchain.com/

In [None]:
os.environ['LANGCHAIN_TRACING_V2'] = 'true'
os.environ['LANGCHAIN_ENDPOINT'] = 'https://api.smith.langchain.com'
os.environ['LANGCHAIN_API_KEY'] = <your-api-key>

### Part 5: Multi-Query

First I'm going to index this blog post on agents, load it, split it, and then index it in Chroma locally being this the Vector Store as we've done this previously. Now I have my index defined. 

Docs:

* https://python.langchain.com/docs/modules/data_connection/retrievers/MultiQueryRetriever

### Index

In [None]:
#### INDEXING ####

# Load Blog
import bs4
from langchain_community.document_loaders import WebBaseLoader
loader = WebBaseLoader(
    web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            class_=("post-content", "post-title", "post-header")        
        )    
    ),
)
blog_docs = loader.load()

#Split
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=300,
    chunk_overlap=50)

#Make Splits 
splits = text_splitter.split_documents(blog_docs)

#Index
from langchain.text_splitter import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
vectorstore = Chroma.from_documents(documents=splits,
                                   embedding=OpenAIEmbeddings())

retriever = vectorstore.as_retriever()

Next, I'm going to define my prompt for Multi-Query, that's the template. We pass our prompt, then that to an LLM, parse it into a string and then split the string by new lines, and so, we'll get a list of questions of this chain. 

### Prompt

In [None]:
from langchain.prompts import ChatPromptTemplate

# Multi Query: Different Perspectives
template = """ You are an AI language model assistant. Your task is to generate five 
different versions of the given user question to retrieve relevant documents from a vector
database. By generating multiple perspectives on the user question, your goal is to help
the user overcome some of the limitations of the distance-based similarity search.
Provide these alternative questions separated by newlines. Original question: {question}"""
prompt_perspectives = ChatPromptTemplate.from_template(template)

from langchain_core.output_parsers import StrOutputParser
from langchain_openai import ChatOpenAI

generate_queries = (
    prompt_perspectives
    | ChatOpenAI(temperature=0)
    | StrOutputParser()
    | (lambda x: x.split("\n"))
    
)

That is our defined generate query chain, that's the sample input question. We're going to take that list and simply apply each question to retriever, so, we'll do retrieval per question. The function 'get_unique_union' is going to take the unique union of documents across all those retrievals. After running that we're going to get some set of questions or documents back.

In [None]:
from langchain.load import dumps, loads

def get_unique_union(documents:list[list]):
    """Unique union of retrieved documents """
    # Flatten list of lists, and convert each Document to string
    flattened_docs = [dumps(doc) for sublist in documents for doc in sublist]
    #Get unique documents
    unique_docs = list(set(flattened_docs))
    #Return
    return [loads(doc) for doc in unique_docs]

#Retrieve
question = "What is task decomposition for LLM agents?"
retrieval_chain = generate_queries | retriever.map() | get_unique_union
docs = retrieval_chain.invoke({"question":question})
len(docs)

If we go to LangSmith we can see what happened under the hood. Here we ran our initial chain to generate a set of reframed questions from our input. We can see a list of five retrievers given the next generated questions from the initial question "What is task decomposition for LLM agents?":

- How do LLM agents perform task decomposition?
- Can you explain the concept of task decomposition in LLM agents?
- What are the methods used by LLM agents for task decomposition?
- How does task decomposition work in the context of LLM agents?
- What is the role of task decomposition in LLM agent's funtioning?

For one of those questions we did an independent retrieval.

Now, let's come back to the code and show this working end to end. We are going to take that retrieval chain and pass it into context of our final RAG prompt, pass it to an LLM, and then parse the output:

In [None]:
from operator import itemgetter
from langchain_openai import ChatOpenAI
from langchain_core.runnables import RunnablePassthrough

#RAG
template = """Answer the following question based on this context:

{context}

Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)

llm = ChatOpenAI(temperature=0)

final_rag_chain = (
    {"context": retrieval_chain,
    "question": itemgetter("question")}
    | prompt
    | llm
    | StrOutputParser()
)

final_rag_chain.invoke({"question":question})

We can then go to LangSmith and see what happened under the Hood. Our final chain consisted on: taking our input question, breaking it out to 5 rephrased questions, doing a retrieval for every of those, taking the unique union of documents. Then, we can see our final LLM prompt answering the question based on the context. 

LangSmith can be used to investigate those intermediate questions that you generate in the question generation phase.

### <a id="query-translation-rag-fusion"></a>Query Translation (RAG Fusion)

In this first stage in an advanced RAG pipeline, we’re taking an input user question and we’re translating it in some way in order to improve retrieval. Now, we showed this general mapping of approaches previously, so, again, you have kind of rewriting, so you can take a question and break it down into differently worded different perspectives of the same question, so that’s rewriting; there’s sub-questions where you take a question, break it down into smaller problems, solve each one independently; and there’s Step-Back where you take a question and go more abstract where you ask a higher level question as a precondition to answer the user question, so those are approaches and we’re going to dig into one of the particular approaches for rewriting called RAG-Fusion.

<div style="text-align: center;">
<img src="RAG23.5.png" alt="Image" style="display: block; margin: 0 auto;">
<p style="text-align: center;"><b>Fig 21</b>: General Approaches to Transform Queries - RAG-Fusion</p>
</div>

Now, this is really similar to what we just saw with multi-query, the difference being we actually apply a a clever ranking step of our retrieve documents, which you call Reciprocal Rank Fusion. That’s really the only difference, the input stage of thinking a question, breaking it out into a few kinds of differently worded questions, retrieval of each one, is all the same, and we’re going to see that in code here shortly. 

<div style="text-align: center;">
<img src="RAG24.5 - RAG-Fusion - Query Translation.png" alt="Image" style="display: block; margin: 0 auto;">
<p style="text-align: center;"><b>Fig 22</b>: RAG-Fusion Diagram - Improve Search and Produce Consolidate Ranking</p>
</div>

### Part 6: RAG-Fusion

Docs:

* https://github.com/langchain-ai/langchain/blob/master/cookbook/rag_fusion.ipynb?ref=blog.langchain.dev

Blog / repo: 

* https://towardsdatascience.com/forget-rag-the-future-is-rag-fusion-1147298d8ad1

The first thing you'll note is what our prompt is, so it looks really similar to the prompt we just saw with Multi-Query.

### Prompt

In [None]:
from langchain.prompts import ChatPromptTemplate

# RAG-Fusion: Related
template = """You are a helpful assistant that generates multiple search queries based on a single input query. \n
Generate multiple search queries related to: {question} \n
Output {4 queries}:"""
prompt_rag_fusion = ChatPromptTemplate.from_template(template)

Let's define our prompt and here is our query generation chain, again this looks like we just saw: we take our prompt, plug that into an LLM, and then basically parse by new lines, and that will basically split out these questions into a list.

In [None]:
from langchain_core.output_parsers import StrOutputParser
from langchain_openai import ChatOpenAI

generate_queries = (
    prompt_rag_fusion
    | ChatOpenAI(temperature=0)
    | StrOutputParser()
    | (lambda x: x.split("\n"))
)

Now, here's where the novelty comes in. Each time we do retrieval from one of those questions, we are going to get back a list of documents from our retriever, and so, we do it until we generate 4 answers based on our prompt that consisted on 4 questions, then, we get a list of lists. Reciprocal Rank Fusion is really well suited for this exact problem where we want to take this lists and build a single consolidated list. It is looking at the document in each list and aggregating them into a final output ranking.

In [None]:
from langchain.load import dumps, loads

def reciprocal_rank_fusion(results: list[doc_str] = 0
                        #Retrieve the current score of the document, if any
            previous_score = fused_scores[doc_str]
                           , k=60):
    """ Reciprocal_rank_fusion that takes multiple lists of ranked documents
    and an optional parameter k used in the RRF formula """
    
    # Initialize a dictionary to hold fused scores for each unique document
    fused_scores = {}
    
    # Iterate through each list of ranked documents
    for docs in results:
        # Iterate through each document in the list, with its rank (position in the list)
        for rank, doc in enumerate(docs):
            # Convert the document to a string format to use as a key (assumes documents can be serialized to JSON)
            doc_str = dumps(doc)
            # if the document is not yet in the fused_scores dictionary, add it with an initial score of 0
            if doc_str not in fused_scores:
                fused_scores[doc_str] = 0
            #Retrieve the current score of the document, if any
            previous_score = fused_scores[doc_str]
            # Update the score of the document using the RRF formula: 1 / (rank + k)
            fused_scores[doc_str] += 1/ (rank + k)
            
    #Sort the documents based on their fused scores in descending order to get the final reranked results
    reranked_results = [
        (loads(doc), score)
        for doc, score in sorted(fused_scores.items(), key=lambda x: x[1], reverse=True)
    ]
    
retrieval_chain_rag_fusion = generate_queries | retriever.map() | reciprocal_rank_fusion
docs = retrieval_chain_rag_fusion.invoke({"question": question})
len(docs)

Now we can go to LangSmith and see what is happening in some detail. Here we can see our prompt to your helpful assistant that generate multiple search queries based on a single input, and also our search queries that were these:

- How does task decomposition work for LLM agents?
- Benefits of task decomposition in LLM agents?
- Examples of task decomposition techniques for LLLM agents.
- Challenges in implementing task decomposition for LLM agents.

And for each one of the outputs we have a retrieval, in total 4 retrievals, and then, those retrievals simply go into the previous function explained and our corresponding ranking to a final list of six unique ranked documents. So, let's actually put all that together into a full RAG chain that's going to run Retrieval, return that final list of ranked documents and pass it to our context, pass through our question, send to a RAG prompt, pass it to an LLM, parse it to an output and let's run all that together and see that working. 

In [None]:
from langchain_core.runnables import RunnablePassthrough 

#RAG 
template = """Answer the following question based on this context: 

{context}

Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)

final_rag_chain = (
    {"context": retrieval_chain_rag_fusion,
    "question": itemgetter("question")}
    | prompt
    | llm
    | StrOutputParser()
)

final_rag_chain.invoke({"question":question})

If we look in LangSmith, it is possible to see those four questions, the retrievers for each question, and our final RAG prompt plumed through the final list of ranked four questions. 

So, this can be really convenient, particularly if we're operating across different vector stores, or we want to do retrieval across a large number of differently worded questions, this Reciprocal Rank Fusion step is really nice if for example we wanted to only take the top three documents or something. It can be really nice to build that consolidated ranking across all these independent retrievals, then pass that to an LLM for the final generation.

### Codeium Detailed Explanation

#### Indexing

The code starts by loading a blog post from a URL using the WebBaseLoader class from langchain_community.document_loaders. The loader is configured to parse only the content, title, and header of the post.

The loaded blog post is then split into smaller chunks using the RecursiveCharacterTextSplitter class from langchain.text_splitter. The splitter is configured to split the text into chunks of 300 characters with an overlap of 50 characters.

The resulting chunks are then indexed using the Chroma class from langchain_community.vectorstores. The Chroma class is a vector store that allows for efficient similarity searches. The indexing process involves converting the chunks into embeddings using the OpenAIEmbeddings class from langchain.text_splitter.

#### RAG-Fusion

The code then defines a prompt template for generating multiple search queries based on a single input query. The prompt template is defined using the ChatPromptTemplate class from langchain.prompts.

The generate_queries function is defined as a pipeline that takes the prompt template, generates queries using the ChatOpenAI class from langchain_openai, and parses the output using the StrOutputParser class from langchain_core.output_parsers.

Reciprocal Rank Fusion

The reciprocal_rank_fusion function is defined as a function that takes multiple lists of ranked documents and an optional parameter k used in the RRF formula.

Here's a step-by-step explanation of the reciprocal_rank_fusion function:

1. Initialize an empty dictionary fused_scores to hold the fused scores for each unique document.
2. Iterate through each list of ranked documents in the results parameter.
3. For each document in the list, convert the document to a string format using the dumps function from langchain.load. This is done to use the document as a key in the fused_scores dictionary.
4. Check if the document is already in the fused_scores dictionary. If not, add it with an initial score of 0.
5. Retrieve the current score of the document from the fused_scores dictionary.
6. Update the score of the document using the RRF formula: 1 / (rank + k), where rank is the position of the document in the list and k is the optional parameter.
7. Repeat steps 3-6 for each document in each list.
8. Sort the documents based on their fused scores in descending order to get the final reranked results.
9. Return the reranked results as a list of tuples, where each tuple contains the document and its score.

#### Retrieval Chain

The retrieval_chain_rag_fusion function is defined as a pipeline that takes the generate_queries function, maps the output to the retriever function, and applies the reciprocal_rank_fusion function.

The docs variable is assigned the result of invoking the retrieval_chain_rag_fusion function with a question as input.

#### RAG

The code then defines a prompt template for answering a question based on a given context. The prompt template is defined using the ChatPromptTemplate class from langchain.prompts.

The final_rag_chain function is defined as a pipeline that takes the retrieval_chain_rag_fusion function, the prompt template, and the llm function (not defined in the code snippet).

The final_rag_chain function is invoked with a question as input, and the result is not assigned to any variable.

#### Important Considerations

1. The reciprocal_rank_fusion function assumes that the documents can be serialized to JSON format using the dumps function from langchain.load.
2. The k parameter in the RRF formula is an optional parameter that can be adjusted based on your specific needs. A larger value of k can give more weight to the documents with higher ranks.
3. The code uses the ChatOpenAI class from langchain_openai to generate search queries based on a single input query.
4. The llm function is not defined in the code snippet, so it's assumed to be defined elsewhere in your codebase.
5. The code uses the StrOutputParser class from langchain_core.output_parsers to parse the output of the chat prompt into a string.

#### Possible Changes or Additions

1. Modify the k parameter in the RRF formula to adjust the weight given to different ranks.
2. Change the bs_kwargs dictionary in the WebBaseLoader class to load different web pages or specify different parsing rules.
3. Modify the chunk_size and chunk_overlap parameters in the RecursiveCharacterTextSplitter class to adjust the splitting of the text.
4. Change the embedding parameter in the Chroma.from_documents function to use different embeddings or pre-trained models.
5. Modify the template in the ChatPromptTemplate class to change the prompt format or add additional information.
6. Replace the llm function with a different language model or chatbot implementation.
7. Add additional processing steps to the retrieval_chain_rag_fusion pipeline, such as filtering or ranking the results.
8. Use a different algorithm for ranking the documents, such as BM25 or TF-IDF.
9. Experiment with different values for the temperature parameter in the ChatOpenAI class to adjust the level of randomness in the generated queries.
10. Use a different output parser, such as JSONOutputParser, to parse the output of the chat prompt into a different format.

#### Considerations

1. The reciprocal_rank_fusion function assumes that the documents can be serialized to JSON format using the dumps function from langchain.load. If this is not the case, you may need to modify the function to use a different serialization method.
2. The k parameter in the RRF formula is an optional parameter that can be adjusted based on your specific needs. A larger value of k can give more weight to the documents with higher ranks.
3. The code uses the ChatOpenAI class from langchain_openai to generate search queries based on a single input query. If you want to use a different language model or chatbot implementation, you will need to replace this class with a different one.
4. The llm function is not defined in the code snippet, so it's assumed to be defined elsewhere in your codebase. If you want to use a different language model or chatbot implementation, you will need to define this function accordingly.
5. The code uses the StrOutputParser class from langchain_core.output_parsers to parse the output of the chat prompt into a string. If you want to parse the output into a different format, you will need to use a different output parser.

### <a id="query-translation-decomposition"></a>Query Translation (Decomposition)
Your content here...

### <a id="query-translation-step-back"></a>Query Translation (Step Back)
Your content here...

### <a id="query-translation-hyde"></a>Query Translation (HyDE)
Your content here...

### <a id="routing"></a>Routing
Your content here...

### <a id="query-construction"></a>Query Construction
Your content here...

### <a id="indexing-multi-representation"></a>Indexing (Multi Representation)
Your content here...

### <a id="indexing-raptor"></a>Indexing (RAPTOR)
Your content here...



### <a id="indexing-colbert"></a>Indexing (ColBERT)
Your content here...



### <a id="crag"></a>CRAG
Your content here...



### <a id="adaptive-rag"></a>Adaptive RAG
Your content here...



### <a id="the-future-of-rag"></a>The Future of RAG
Your content here...