## RAG in depth

Today we will go over building a retrieval augmented generation chatbot in detail.

The two main activities are:

- Index Creation
- Question Answering

Most of the steps under each activity are optional, but may improve answer quality.

# Index creation steps

1. Load the documents
2. Enhance document quality
3. Split the documents into chunks
4. Add metadata to the chunks
5. Generate embeddings for the chunks
6. Index the chunks

## Load the documents

The documents have to be converted to plain or markdown text format. This is more difficult than it seems. Documents may be in PDF, Powerpoint, or MS Word format, or you may need to fetch data from a database or slack.

LlamaIndex has loaders for many different sources and file formats:
- https://docs.llamaindex.ai/en/stable/module_guides/loading/

I have found that Unstructured.io has high-quality loaders.
- API: https://docs.unstructured.io/api-reference/api-services/overview
- Open source: https://github.com/Unstructured-IO/unstructured
- Integration with LlamaIndex: https://docs.unstructured.io/open-source/integrations#integration-with-llamaindex

## Enhance document quality

Some documents contain abbreviations or terms that can be understood only in the context in which they appear (e.g., does PO stand for post office or purchase order?). Or they contain tables or images that must be processed specially.

This step generally involves studying your documents and writing custom code.

## Split the documents into chunks

Each document must be split into small (500-2000 characters) chunks of text, where each chunk is added to the index as a separate object. Later when we query the index, the most-relevant chunks are returned. The goal is to create chunks that are more-or-less self-contained: they contain enough information to answer a question but not too much. Splitting is so important that we will have a separate class on splitting.

There are many kinds of splitters (you could even create your own that combines ideas from the ones below):
- Sliding window: simply create chunks from every N characters with overlap
- Structure: take the HTML or markdown structure into account to create chunks based upon headings and paragraphs.
- Semantic: combine sentences with similar vectors into chunks.
- Tree: create chunks at different levels of granularity; parent chunks may contain summaries.

LlamaIndex has many types of splitters: https://docs.llamaindex.ai/en/stable/module_guides/loading/node_parsers/modules/

## Add metadata to the chunks

Each chunk includes text, but it can also be useful to add additional information to each chunk to provide context for answering questions:
- file name and markdown/html section headers
- pointers to the previous, next, and parent chunks
  - or maybe summaries of the previous + next chunks, or a summary of the parent chunk
- entities related to the chunk (author, product name, etc.)

The step generally involves studying your chunks and writing custom code.

## Generate embeddings for the chunks

We need to generate embeddings (vectors) for each chunk in order to index the chunk. The better the embedding is able to capture the "semantic meaning" of the chunk, the more-likely your retrieved chunks will be relevant to the question. Popular embeddings are OpenAI, VoyageAI, and coHere.

When generating the embeddings, a question to ask is: Should I generate the embedding based upon the chunk text only, or should I also include some or all of the metadata?

LlamaIndex supports many embeddings: https://docs.llamaindex.ai/en/stable/module_guides/models/embeddings/#list-of-supported-embeddings

## Index the chunks

In addition to the embedding (which is called a "dense" embedding), some indexes also support sparse embeddings. It's possible to query both dense and sparse embeddings to improve the relevance of the retrieved chunks. This is called "hybrid" search.

Several indexes support hybrid search. We will cover indexing and hybrid search in a separate class.

LlamaIndex supports many indexes: https://docs.llamaindex.ai/en/stable/module_guides/storing/vector_stores/

You will notice that LlamaIndex makes a (confusing) distinction between vector stores, document stores, index stores, and other types of stores. We will cover this in the class on indexing.

# Question-answering steps

1. Transform the question 
2. Route the question to different indexes
3. Query the index to retrieve chunks that are relevant to the question
4. Post-process the chunks
5. Generate a prompt and send it to the LLM
6. Analyze the answer and possibly repeat these steps with a follow-on question

## Transform the question 

When we query the index, we generate an embedding for the question and compare it to the embeddings we generated for the chunks. But what if the question and the relevant chunks aren't that semantically similar? In this case we might want to augment or replace the question with one that contains more words/concepts that are likely to be found in the relevant chunks.

For example, the question: "why are there two priesthoods?" doesn't contain the words Aaronic or Melchizedek. But the relevant chunks would likely contain both of those words.

One way to do this, called HyDE, is to ask the LLM to guess an answer without looking at anything in the database. It's answer might contain hallucinations, but that's ok as long as contains words/concepts that are likely to be found in the relevant chunks. We use the original question and the guessed answer when generating the embedding for the question.

LlamaIndex supports HyDE and other query transformations:
- https://docs.llamaindex.ai/en/stable/examples/query_transformations/HyDEQueryTransformDemo/?h=hyde
- https://docs.llamaindex.ai/en/stable/examples/query_transformations/query_transform_cookbook/

## Route the question to different indexes

Suppose you are building a customer-support chat-bot, and you have some documents that contain information about products from the product catalog, and other documents that contain information about shipping and how to return items for a refund from the customer support knowledgebase. Rather than putting all documents into a single index, you may be better off to create one index from the product catalog and a separate index from the customer support knowledgebase. Then there's less chance that you'll retrieve a chunk from the product catalog when a customer is asking how to return the "moto g" phone they just bought.

If you have multiple indexes, the next step is to determine which index to query for an incoming question. This is called routing.

LlamaIndex supports several kinds of routers: https://docs.llamaindex.ai/en/stable/module_guides/querying/router/

## Query the index to retrieve chunks that are relevant to the question

This step involves generating an embedding for the question and using that embedding to query an index to find chunks with similar embeddings. 

One thing you can ask yourself at this point is: Can I extract metadata from the question and use it to filter the chunks returned? For example, if the question is "tell me what Elder Holland said about adversity" and you've extracted the author in each chunk's metadata, you could extract Elder Holland from the question and pass that to the index as a filter.

LlamaIndex retriever support (the default usually works fine):
- https://docs.llamaindex.ai/en/stable/module_guides/querying/retriever/
- https://docs.llamaindex.ai/en/stable/module_guides/querying/retriever/retrievers/

## Post-process the chunks

You may find that the chunk texts don't contain enough context - that the LLM needs additional context to answer the question. The idea is that the chunk text that you generated the embedding from doesn't necessarily have to be the same as the chunk text that you send to the LLM. For exmaple, you may find it helpful to include one of the following when sending the chunk to the LLM:
- the text of the previous and/or next chunks (or portions/summaries of those texts)
- summary of the parent section

Here are some examples augmenting nodes with previous and next chunks, or parents:
- https://docs.llamaindex.ai/en/stable/examples/node_postprocessor/MetadataReplacementDemo/
- https://docs.llamaindex.ai/en/stable/examples/retrievers/auto_merging_retriever/

## Post-process the chunks (continued)

In addition to augmenting the chunk text, you may find it helpful to:
- re-rank the chunks using a more-expensive ranking model to ensure that the most-relevant chunks get inserted into the prompt first
- remove chunks whose similarity score lies below a certain threshold 
- remove sentences from chunks that aren't relevant to the question
- summarize chunks to retain only information that is relevant to the question

LlamaIndex supports a wide variety of post-processors, especially re-rankers
- https://docs.llamaindex.ai/en/stable/module_guides/querying/node_postprocessors/
- https://docs.llamaindex.ai/en/stable/module_guides/querying/node_postprocessors/node_postprocessors/



## Generate a prompt and send it to the LLM

This step involves generating a prompt that includes instructions and the chunk texts and sending the prompt to the LLM to generate an answer to the question.

There are four things to think about at this step:
1. What should my instructions say?
2. Should I include examples of "question + chunks -> ideal answer" in my prompt to help the LLM understand how I want it to answer the question, and if so, which examples should I use? A few examples often help.
3. Should I include questions + answers from the chat history in the prompt so if the new question references something from a previous question or answer, the LLM will be able to understand the reference?
4. Should I include all of the chunks at once, or should I include them iteratively?

## Generate a prompt and send it to the LLM (continued)

DSPy can help you come up with the best instructions and examples. Once you have created your index, you can use DSPy to optimize the prompt, then use the optimized prompt with LlamaIndex.

LlamaIndex has code to:
- customize your prompt: https://docs.llamaindex.ai/en/stable/examples/prompts/prompts_rag/
- include chat history: https://docs.llamaindex.ai/en/stable/module_guides/deploying/chat_engines/
- determine which chunks to include (you don't need to worry about this usually - including all chunks at once works fine most of the time): https://docs.llamaindex.ai/en/stable/module_guides/querying/response_synthesizers/

## Analyze the answer and possibly repeat these steps with a follow-on question

Traditional question-answer chatbots return the answer to the user immediately, but what if before returning the answer to the user, you gave the question and answer to the LLM and asked if more work was needed to properly answer the question? Maybe the LLM determines that it needs to query additional information from the index before it can completely answer the question. Or maybe you determine that you should have asked the question initially from multiple points of view and then had a final LLM combine all of the answers together into a comprehensive answer. 

For lack of a better word, these kinds of workflows are called "Agentic". This is an advanced concept.

Llamaindex has support for agents: https://docs.llamaindex.ai/en/stable/module_guides/deploying/agents/

# More documentation links

LlamaIndex has a lot of documentation. It's not always up to date and it doesn't explain some things very well but that's pretty common with open-source software. If you run into issues or have any questions, please let me know. 

Here are a few additional links you may find useful:

- https://docs.llamaindex.ai/en/stable/use_cases/q_and_a/
- https://docs.llamaindex.ai/en/stable/examples/
- https://docs.llamaindex.ai/en/stable/understanding/putting_it_all_together/q_and_a/