# Retrieval Augmented Generation based Question Answering pipeline
Throughout this notebook, I'll show how I created a Retrieval Augmented Generation (RAG) pipeline for question answering over publicly available pdf and HTML data from PWC's website. Throughout the whole notebook, I will rely on Llama Index, an LLM application library. The notebook consists of 3 main chapters:
1. Data Loading Pipeline: Assembling and running the pipeline that will load, transform and store our input data
2. Question Answering Pipeline: Assembling and testing the pipeline that will generate answers for the posed questions

Before we start, let's install the necessary libraries and initialize some variables that we will use throughout the notebook

## Installation of required packages
Run the cell below. This cell should be ran only when you open this notebook for the very first time, after that, you don't have to run it.

In [16]:
!pip install llama-index llama-index-embeddings-ollama llama-index-llms-ollama llama-index-vector-stores-chroma llama-index-readers-file fitz pymupdf spacy nest-asyncio

Collecting spacy
  Downloading spacy-3.8.2-cp311-cp311-win_amd64.whl.metadata (27 kB)
Collecting spacy-legacy<3.1.0,>=3.0.11 (from spacy)
  Downloading spacy_legacy-3.0.12-py2.py3-none-any.whl.metadata (2.8 kB)
Collecting spacy-loggers<2.0.0,>=1.0.0 (from spacy)
  Downloading spacy_loggers-1.0.5-py3-none-any.whl.metadata (23 kB)
Collecting murmurhash<1.1.0,>=0.28.0 (from spacy)
  Downloading murmurhash-1.0.11-cp311-cp311-win_amd64.whl.metadata (2.0 kB)
Collecting cymem<2.1.0,>=2.0.2 (from spacy)
  Downloading cymem-2.0.10-cp311-cp311-win_amd64.whl.metadata (8.6 kB)
Collecting preshed<3.1.0,>=3.0.2 (from spacy)
  Downloading preshed-3.0.9-cp311-cp311-win_amd64.whl.metadata (2.2 kB)
Collecting thinc<8.4.0,>=8.3.0 (from spacy)
  Downloading thinc-8.3.2-cp311-cp311-win_amd64.whl.metadata (15 kB)
Collecting wasabi<1.2.0,>=0.9.1 (from spacy)
  Downloading wasabi-1.1.3-py3-none-any.whl.metadata (28 kB)
Collecting srsly<3.0.0,>=2.4.3 (from spacy)
  Downloading srsly-2.4.8-cp311-cp311-win_amd64

  You can safely remove it manually.
  You can safely remove it manually.
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
langchain 0.3.10 requires numpy<2,>=1.22.4; python_version < "3.12", but you have numpy 2.0.2 which is incompatible.
langchain-community 0.3.10 requires numpy<2,>=1.22.4; python_version < "3.12", but you have numpy 2.0.2 which is incompatible.


## Initialize shared resources
You need to run this cell every time you restart your kernel.

In [1]:
from llama_index.core.ingestion import IngestionPipeline, DocstoreStrategy, IngestionCache
from llama_index.legacy import SimpleDirectoryReader
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.readers.file import PyMuPDFReader, HTMLTagReader
from llama_index.core.node_parser import SemanticSplitterNodeParser
from llama_index.embeddings.ollama import OllamaEmbedding
from llama_index.core.storage.docstore import SimpleDocumentStore
from llama_index.llms.ollama import Ollama
import chromadb
import nest_asyncio

# Set configuration for Ollama
ollama_config = {
    "base_url": "127.0.0.1:11434",
    "embedding_model_name": "nomic-embed-text",
    "llm_name": "llama3.2"
}

# Set project data paths
project_data_paths = {
    "input_data_dir_path": "../data",
    "evaluation_results_dir_path": "../results",
    "vector_db_data_dir_path": "../vector_db_data",
    "pipeline_cache_dir_path": "../pipeline_cache",
}

# Define common resources
embedding_model = OllamaEmbedding(base_url=ollama_config["base_url"], model_name=ollama_config["embedding_model_name"])
llm = Ollama(model=ollama_config["llm_name"], base_url=ollama_config["base_url"])
chroma_client = chromadb.PersistentClient(project_data_paths["vector_db_data_dir_path"])
chroma_collection = chroma_client.get_or_create_collection("pwc_data")
chroma_vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
nest_asyncio.apply()


## Data Loading Pipeline
In this phase, we are going to load the PDF and HTML input documents.

### What the code does
- Instantiates and configures necessary objects to assemble the ingestion pipeline
- Instantiates and configures the data ingestion pipeline

### Loaders
- PyMuPDFReader: Used to load PDF documents. I used this specific loader, because it relies on the well known PyMuPDF PDF parsing library, which is capable of identifying tables and other not necessarily text based object in the files. Another good choice would have been the SmartPDFLoader, but it requires llmsherpa backend service to be hosted, and it would further complicate the project. According to my previous experiences and the current performance of the loader, I found PyMuPDFReader is enough for now.
- HTMLTagReader: Used to load HTML documents. It relies on the well known BeautifulSoup library, and it extracts text data from specified HTML tags, to filter out unused data, such as javascript scripts. 

### SimpleDirectoryReader
This class is responsible to load files from a directory and use the specified readers to parse them. It supports external file systems too. In the code, I configured it to use the PyMuPDFReader for .pdf files and the HTMLTagReader for .html files. This way, I was able to create a unified loader, which meant both the pdf and html file's documents will be treated equally. Ofcourse, it is possible to treat them separately, but this is not necessary in our use-case, since the data the two file type holds can be considered to be in the same data domain, when we look at their content.

### IngestionPipeline
The ingestion pipeline is responsible to Transform the documents into Node objects, generate embeddings for them and store the Node-Embedding pairs in a vector database. Additionally, it manages a document store and a cache, so if it is run again using the same data, the pipeline will use the cached values instead of performing a full load again. (currently doesn't work for some reason).

### SemanticSplitterNodeParser
SemanticSplitterNodeParser is a Node parser, or a chunking method. I used it, because it doesn't use a static window size. Node parsers that are using static window sizes are unaware of internal document topics, and they easily combine text from two very different chapters. The problem with this, it that the chunk's embedding will be weak, uninformative, because it contains text from two so different topics and in worst case scenario, they might never be used in the RAG pipeline, because no query will ever be similar to it enough. In contrast, SemanticSplitterNodeParser creates the chunks by taking their meaning into account. It breaks the documents into sentences, generates embeddings to all of them and compares the embeddings of neighbour sentences to see how similar their embeddings are. If they are similar enough, they will be part of the same node/chunk, if not, a new node/chunk will be created and the sentence will be the first item of that node/chunk. Thanks to this clever approach, it is capable of producing dynamically sized chunks that encapsulate a topic in the document and by this, it is capable of chunking the document up without mushing internal document topics together.




In [2]:
# Define unified directory file loader
pdf_reader = PyMuPDFReader()
html_reader = HTMLTagReader(tag="section", ignore_no_id=True)
file_extractor = {".pdf": pdf_reader, ".html": html_reader}
document_reader = SimpleDirectoryReader(
    input_dir=project_data_paths["input_data_dir_path"], file_extractor=file_extractor
)

# Define unified document processing pipeline
pwc_document_processing_pipeline = IngestionPipeline(
    name="PWC document ingestion pipeline",
    project_name="PWC example project",
    docstore=SimpleDocumentStore(),
    docstore_strategy=DocstoreStrategy.UPSERTS,
    transformations=[SemanticSplitterNodeParser(embed_model=embedding_model), embedding_model],
    vector_store=chroma_vector_store,
    cache=IngestionCache()
)

Lastly, we run the loading and transforming/storing pipeline

In [3]:
documents = document_reader.load_data(show_progress=True, num_workers=10)
pwc_document_processing_pipeline.run(documents=documents, num_workers=10, cache_collection="pwc_cache")
pwc_document_processing_pipeline.persist(persist_dir=project_data_paths["pipeline_cache_dir_path"])

## Question Answering Pipeline
In this phased, we create the pipeline that will generate the answers to the questions.

### Answer generation architecture
Here, we first create a VectorStoreIndex. The VectorStoreIndex is an object that provides access to the previously indexed and stored data inside the vector database. This is then used in a Query Engine component, which is responsible to orchestrate the response generation. When the pipeline is invoked, the Query Engine component first invokes the embedding model to embed the user's query. Then this embedding is passed to the VectorStoreIndex component, to retrieve the top 5 most similar chunks to the question embedding. After receiving the top 5 chunks, the QueryEngine invokes the ResponseSynthesizer module (which is by default part of the QueryEngine component) and concatenates these 5 chunks, then inserts it into a prompt, that it sends to the LLM for response generation. After receiving the answer, the QueryEngine component returns the response, which is the answer to our question.



In [14]:
from llama_index.core import VectorStoreIndex

pwc_vector_store_index = VectorStoreIndex.from_vector_store(
    chroma_vector_store,
    embed_model=embedding_model,
)
pwc_query_engine = pwc_vector_store_index.as_query_engine(
    llm=llm,
    similarity_topk=5
)

Invoking the pipeline with a question the system should know the answer for.

In [15]:
result = pwc_query_engine.query("Which country was the best in youth employment in 2024?")

INFO:httpx:HTTP Request: POST http://127.0.0.1:11434/api/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST http://127.0.0.1:11434/api/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST http://127.0.0.1:11434/api/embeddings "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://127.0.0.1:11434/api/chat "HTTP/1.1 200 OK"
HTTP Request: POST http://127.0.0.1:11434/api/chat "HTTP/1.1 200 OK"
HTTP Request: POST http://127.0.0.1:11434/api/chat "HTTP/1.1 200 OK"


In [16]:
print(result)

The Netherlands.
