# Retrieval Augmented Generation based Question Answering pipeline

Throughout this notebook, I'll show how I created a Retrieval Augmented Generation (RAG) pipeline for question answering over publicly available pdf and HTML data from PWC's website. Throughout the whole notebook, I will rely on Llama Index, an LLM application library. The notebook consists of 3 main chapters:

1. Data Loading Pipeline: Assembling and running the pipeline that will load, transform and store our input data

2. Question Answering Pipeline: Assembling and testing the pipeline that will generate answers for the posed questions


Before we start, let's install the necessary libraries and initialize some variables that we will use throughout the notebook


## Installation of required packages

Run the cell below. This cell should be ran only when you open this notebook for the very first time, after that, you don't have to run it.


In [None]:
!pip install --quiet llama-index llama-index-embeddings-ollama llama-index-llms-ollama llama-index-vector-stores-chroma llama-index-readers-file fitz pymupdf spacy nest-asyncio

## Initialize shared resources

You need to run this cell every time you restart your kernel.


In [1]:
from llama_index.core.ingestion import IngestionPipeline, DocstoreStrategy, IngestionCache
from llama_index.legacy import SimpleDirectoryReader
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.readers.file import PyMuPDFReader, HTMLTagReader
from llama_index.core.node_parser import SemanticSplitterNodeParser
from llama_index.embeddings.ollama import OllamaEmbedding
from llama_index.core.storage.docstore import SimpleDocumentStore
from llama_index.llms.ollama import Ollama
import chromadb

# Set configuration for Ollama
ollama_config = {
    "base_url": "127.0.0.1:11434",
    "embedding_model_name": "nomic-embed-text",
    "llm_name": "llama3.2"
}

# Set project data paths
project_data_paths = {
    "input_data_dir_path": "../data",
    "evaluation_results_dir_path": "../results",
    "vector_db_data_dir_path": "../vector_db_data",
    "pipeline_cache_dir_path": "../pipeline_cache",
}

# Define common resources
embedding_model = OllamaEmbedding(base_url=ollama_config["base_url"], model_name=ollama_config["embedding_model_name"])
llm = Ollama(model=ollama_config["llm_name"], base_url=ollama_config["base_url"])
chroma_client = chromadb.PersistentClient(project_data_paths["vector_db_data_dir_path"])
chroma_collection = chroma_client.get_or_create_collection("pwc_data")
chroma_vector_store = ChromaVectorStore(chroma_collection=chroma_collection)


## Data Loading Pipeline

In this phase, we load the PDF and HTML input documents.

### What the Code Does
- Instantiates and configures the necessary objects to assemble the ingestion pipeline.
- Instantiates and configures the data ingestion pipeline.

### Loaders
- **PyMuPDFReader:**  
  Used to load PDF documents. This loader relies on the well-known PyMuPDF PDF parsing library, which is capable of identifying tables and other non-text-based objects in the files. While another good choice could have been the SmartPDFLoader, it requires the `llmsherpa` backend service to be hosted, which would further complicate the project. Based on my prior experience and the current performance of this loader, I found PyMuPDFReader sufficient for now.
  
- **HTMLTagReader:**  
  Used to load HTML documents. This loader utilizes the widely used BeautifulSoup library to extract text data from specified HTML tags, filtering out unused elements such as JavaScript scripts.

### SimpleDirectoryReader
This class is responsible for loading files from a directory and using the specified readers to parse them. It supports external file systems as well. In the code, I configured it to use PyMuPDFReader for `.pdf` files and HTMLTagReader for `.html` files. This setup creates a unified loader, ensuring that both PDF and HTML documents are treated equally. Of course, it is possible to treat them separately, but this is unnecessary in our use case, as the data in the two file types can be considered to belong to the same domain in terms of content.

### IngestionPipeline
The ingestion pipeline is responsible for:
1. Transforming documents into **Node** objects.
2. Generating embeddings for these nodes.
3. Storing Node-Embedding pairs in a vector database.  

Additionally, it manages a document store and a cache. If the pipeline is run again with the same data, it should use the cached values instead of performing a full reload (though this functionality currently does not work for some reason).

### SemanticSplitterNodeParser
The **SemanticSplitterNodeParser** is a node parser (or chunking method). I chose it because it avoids using a static window size. Parsers with static window sizes are unaware of internal document topics and often combine text from different sections, which weakens the embeddings. These mixed embeddings can lead to uninformative chunks, as the content comes from very different topics. In the worst case, such chunks might never be utilized in the RAG pipeline because no query will be similar enough to match them.

In contrast, **SemanticSplitterNodeParser** creates chunks based on their semantic meaning. It breaks the document into sentences, generates embeddings for each sentence, and compares the embeddings of neighboring sentences to determine their similarity. If the sentences are similar enough, they form part of the same node (chunk). If not, a new node is created, and the sentence becomes the first item in that new node.

This approach dynamically sizes chunks to encapsulate cohesive topics within the document. By doing so, it avoids combining unrelated sections and ensures that each chunk remains focused and meaningful, increasing its relevance and usability within the RAG pipeline.




In [2]:
# Define unified directory file loader
pdf_reader = PyMuPDFReader()
html_reader = HTMLTagReader(tag="section", ignore_no_id=True)
file_extractor = {".pdf": pdf_reader, ".html": html_reader}
document_reader = SimpleDirectoryReader(
    input_dir=project_data_paths["input_data_dir_path"], file_extractor=file_extractor
)

# Define unified document processing pipeline
pwc_document_processing_pipeline = IngestionPipeline(
    name="PWC document ingestion pipeline",
    project_name="PWC example project",
    docstore=SimpleDocumentStore(),
    docstore_strategy=DocstoreStrategy.UPSERTS,
    transformations=[SemanticSplitterNodeParser(embed_model=embedding_model), embedding_model],
    vector_store=chroma_vector_store,
    cache=IngestionCache()
)

Lastly, we run the loading and transforming/storing pipeline

In [3]:
documents = document_reader.load_data(show_progress=True, num_workers=10)
pwc_document_processing_pipeline.run(documents=documents, num_workers=10, cache_collection="pwc_cache")
pwc_document_processing_pipeline.persist(persist_dir=project_data_paths["pipeline_cache_dir_path"])

## Question Answering Pipeline

In this phase, we create the pipeline that will generate answers to the questions.

### Answer Generation Architecture

Here, we first create a **VectorStoreIndex**. The VectorStoreIndex is an object that provides access to the previously indexed and stored data inside the vector database. This is then used in a **Query Engine** component, which is responsible for orchestrating the response generation.

When the pipeline is invoked:
1. The Query Engine component first invokes the embedding model to embed the user's query.
2. This embedding is passed to the **VectorStoreIndex** component to retrieve the top 5 most similar chunks to the query embedding.
3. After receiving the top 5 chunks, the Query Engine invokes the **ResponseSynthesizer** module (which is by default part of the Query Engine component). It concatenates these 5 chunks and inserts them into a prompt, which it sends to the LLM for response generation.
4. After receiving the answer, the Query Engine component returns the response, which is the answer to the question.

This pipeline ensures that all the necessary chunks for response generation are directly injected into the response generation prompt and used to augment the LLM appropriately.



In [2]:
from llama_index.core import VectorStoreIndex

pwc_vector_store_index = VectorStoreIndex.from_vector_store(
    chroma_vector_store,
    embed_model=embedding_model,
)
pwc_query_engine = pwc_vector_store_index.as_query_engine(
    llm=llm,
    similarity_topk=5,
)

Invoking the pipeline with a question the system should know the answer for.

In [None]:
result = pwc_query_engine.query("Which country was the best in youth employment in 2024?")

In [16]:
print(result)

The Netherlands.


# Final Thoughts

### Why Llama Index?
Llama Index is a versatile and efficient framework for building RAG applications. It is specifically designed for advanced indexing and knowledge management, whereas other LLM libraries excel in different areas. For example:
- **LangChain** is far superior for creating and managing LLM applications, agentic behavior, or chatbot building.

As demonstrated throughout the notebook, once its core abstractions are understood, Llama Index is relatively easy to use and requires significantly less boilerplate code than implementing such functionality from scratch. Additionally, it integrates well with other LLM frameworks, such as **LangChain** and **LiteLLM**.

This application did not utilize all the advanced techniques the framework offers due to certain limitations, such as relying on **Ollama** because of my unsupported AMD GPU. Ollama does not support rerankers, which limited the framework's full potential. This brings us to the next question:

### What Could Be Improved?

1. **Auto Evaluation**  
   Auto evaluation is the first thing I would implement. Currently, I couldn't figure out how to make auto evaluation work effectively. The **Llama 3.2 3b** model also struggled with creating a robust test set, which is crucial for conducting meaningful auto evaluation. I would have used metrics such as **faithfulness** and **correctness**. Additionally, I would have experimented with the **Ragas** framework for agent evaluation, but its integration currently suffers from a bug that has yet to be fixed.

2. **Customization of Prompts**  
   While the performance was acceptable, I am confident that with some prompt engineering, even the **Llama 3.2 3b** model could deliver more consistent responses.
