# Data ingestion for vector store
Researches
* PDF document loaders 
* Chunk strategies

## Notes on experiments
- Does not seems that for the images OCR add much value, difficult pictures with a non-common structure.
--> May be better suited for an LLM with multimodal capabilities


### Notes on comparison OCR vs LLM for image parsing
- RapidOCR (~2 min) is much quicker than o4-mini (14 min 03s)
- RapidOCR is much cheaper (free) than o4-mini (51 K tokens, from 38K completion and 13K prompt = 0.1815 $)
- Total request is 69 (approximate total page size of 61 + 7 figures) --> while only 7 'Figures are of maybe interest', additional request because of logo at the top left.
- Pension Federation at the top may also use up considerable resources
- Images description by MUPDF do not always get inserted at the position of the image in the original document, may also be at the end of the page.

- PyMUPDF4LLM (9min 47s with o4-mini)

### PYPDF with standard config

In [1]:
import os

FILE_PATH_TO_EMBED = os.path.join("data/raw", "pension-martijn-files", "Kader Datakwaliteit - wet toekomst pensioenen.pdf.pdf")
DIRECTORY_TO_EMBED = os.path.join("data", "raw", "pension-martijn-files")


In [None]:
import warnings
import os



# load pdf
# from langchain_community.document_loaders import PyPDFLoader
# loader = PyPDFLoader(FILE_PATH_TO_EMBED)

# load directory of PDFs
from langchain_community.document_loaders import PyPDFDirectoryLoader
loader = PyPDFDirectoryLoader(DIRECTORY_TO_EMBED)

pages = []
async for page in loader.alazy_load():
    pages.append(page)


print(f"LOADED DOCUMENT WITH {len(pages)} PAGES")
if len(pages) > 50:
    warnings.warn("DOCUMENT PAGES LOADED IN IS LARGE THAN 100, MAY INCUR SIGNIFICANT COSTS")


for page in pages:
    if page.page_content == None or page.page_content == "":
        warnings.warn("FOUND PAGES IN DOCUMENT WITHOUT PAGE_CONTENT")

LOADED DOCUMENT WITH 130 PAGES




Identified issues
1. footnotes
2. neemt geen figuren mee
3. tabellen, verliest structuur
4. links worden niet meegenomen
5. verlies van layout en structuur, belangrijk met titels, artikelen etc.
6. wordt opgesplist in pagina's, maar is niet altijd logisch einde, miss beter per sectie.

### PyMuPDF with image parser and extract tables

In [7]:
from dotenv import load_dotenv
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_openai import AzureChatOpenAI
from langchain_community.document_loaders.parsers.images import TesseractBlobParser, RapidOCRBlobParser, LLMImageBlobParser


load_dotenv()
# tesseract not downloaded yet as need to install executable by system admin
# images_parser = TesseractBlobParser(langs=('nld',))
# images_parser = RapidOCRBlobParser()
# model = ChatGoogleGenerativeAI(model="model/gemini-2.5-pro-exp-03-25") #did not test this one yet
model = AzureChatOpenAI(
        model="o4-mini",
        azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT_SWEDEN"],
        api_version="2025-01-01-preview",
        api_key=os.environ["AZURE_OPENAI_API_KEY_SWEDEN"]
    )
print(model)
images_parser = LLMImageBlobParser(model=model)

client=<openai.resources.chat.completions.completions.Completions object at 0x0000020A7FE54980> async_client=<openai.resources.chat.completions.completions.AsyncCompletions object at 0x0000020A7FE69160> root_client=<openai.lib.azure.AzureOpenAI object at 0x0000020A7FC7A3C0> root_async_client=<openai.lib.azure.AsyncAzureOpenAI object at 0x0000020A7FE54AD0> model_name='o4-mini' model_kwargs={} openai_api_key=SecretStr('**********') disabled_params={'parallel_tool_calls': None} azure_endpoint='https://openai-playground-bjorn-sweden.openai.azure.com' openai_api_version='2025-01-01-preview' openai_api_type='azure'


In [None]:

from langchain_community.document_loaders import PyMuPDFLoader

# testing loader on entire filepath
loader = PyMuPDFLoader(FILE_PATH_TO_EMBED, extract_images=True, images_parser=images_parser, extract_tables="markdown" )

pages = []
async for page in loader.alazy_load():
    pages.append(page)

client=<openai.resources.chat.completions.completions.Completions object at 0x0000027927C76AD0> async_client=<openai.resources.chat.completions.completions.AsyncCompletions object at 0x0000027927C7AE90> root_client=<openai.lib.azure.AzureOpenAI object at 0x00000279279EA490> root_async_client=<openai.lib.azure.AsyncAzureOpenAI object at 0x0000027927C76C10> model_name='o4-mini' model_kwargs={} openai_api_key=SecretStr('**********') disabled_params={'parallel_tool_calls': None} azure_endpoint='https://openai-playground-bjorn-sweden.openai.azure.com' openai_api_version='2025-01-01-preview' openai_api_type='azure'


Early identified issue with PyPDF, now partly resolved with PyMuPDF:
1. footnotes --> unresolved
2. neemt geen figuren mee --> Improved with LLM for image_parsing
3. tabellen, verliest structuur --> improved with PyMuPDF
4. links worden niet meegenomen --> unresolved (may not be as important)
5. verlies van layout en structuur, belangrijk met titels, artikelen etc. --> unresolved
6. wordt opgesplist in pagina's, maar is niet altijd logisch einde, miss beter per sectie. --> unresolved (may be a langchain issue not investigated yet)

### PyMuPDF4LLM with image_parser and extract tables AND returning markdown
on single mode and page mode ~ 25s without image parsing


In [8]:
from langchain_pymupdf4llm import PyMuPDF4LLMLoader

# https://pymupdf.readthedocs.io/en/latest/page.html#Page.find_tables
# Request a table detection strategy. Valid values are “lines”, “lines_strict” and “text”.
# Default is “lines” which uses all vector graphics on the page to detect grid lines.
# Strategy “lines_strict” ignores borderless rectangle vector graphics. Sometimes single text pieces have background colors which may lead to false columns or lines. This strategy ignores them and can thus increase detection precision.
# If “text” is specified, text positions are used to generate “virtual” column and / or row boundaries. Use min_words_* to request the number of words for considering their coordinates.
# Use parameters vertical_strategy and horizontal_strategy instead for a more fine-grained treatment of the dimensions.

loader = PyMuPDF4LLMLoader(FILE_PATH_TO_EMBED, mode="single", extract_images=True, images_parser=images_parser, table_strategy="lines_strict") #PyMuPDF4LLMLoader(FILE_PATH_TO_EMBED, mode="single") #

pages = []
async for page in loader.alazy_load():
    pages.append(page)

Early identified issue with PyPDF, partly resolved with PyMuPDF, and PyMuPDF4LLM:
1. footnotes included, but not inserted at place of footnote, but where they are located (so llm will not understand) --> unresolved
2. neemt geen figuren mee -->  [PyMuPDF] Improved with LLM for image_parsing
3. tabellen, verliest structuur --> [PyMuPDF] Improved with PyMuPDF
4. links worden niet meegenomen --> [PyMuPDF4LLM] Resolved with markdown formatting
5. verlies van layout en structuur, belangrijk met titels, artikelen etc. --> unresolved --> [PyMuPDF4LLM] Improved with markdown formatting
6. wordt opgesplist in pagina's, maar is niet altijd logisch einde, miss beter per sectie. --> [PyMuPDF] mode='single'

Test concluded:
The test were performed one pdf, namely Kader Datakwaliteit - wet toekomst pensioenen.pdf.
Based on the experiment issues were identified with converting pdf to raw text. Fix all identified issues excluding footnote, which is not the most important issue to focus on as likely adds little value.

Future direction:
Interesting parser may be, which also let's you parse different types of documents, now the pipeline only handles .pdfs
* https://www.llamaindex.ai/llamaparse
* https://unstructured.io/

In [9]:
with open(os.path.join("./experiment-output", "pypdf4llm-on-data-quality-output-SINGLE.md"), 'w', encoding='utf-8') as f:
            f.write(pages[0].page_content)

In [12]:
def print_document_page(page):
    correction_index = 1 #differ per pdf converter, also have different metadata
    print(f"--- source {page.metadata["source"].split("\\")[-1]} ---")
    print(f"--- page {page.metadata["page"] + correction_index} / {page.metadata["total_pages"]} ---")
    print(page)
    print("---end of page")

In [29]:
for page in pages[:61]:
    print_document_page(page)

--- source Kader Datakwaliteit - wet toekomst pensioenen.pdf.pdf ---
--- page 1 / 61 ---
page_content='DATUM: 11 oktober 2022

ONDERWERP: Kader Datakwaliteit

KENMERK: D/2022/718/OH
# Kader Datakwaliteit –  Wet Toekomst Pensioenen

' metadata={'producer': 'Microsoft® Word voor Microsoft 365', 'creator': 'Microsoft® Word voor Microsoft 365', 'creationdate': '2022-10-23T18:44:56+02:00', 'source': 'data/raw\\pension-martijn-files\\Kader Datakwaliteit - wet toekomst pensioenen.pdf.pdf', 'file_path': 'data/raw\\pension-martijn-files\\Kader Datakwaliteit - wet toekomst pensioenen.pdf.pdf', 'total_pages': 61, 'format': 'PDF 1.7', 'title': '', 'author': 'Otto Hulst', 'subject': '', 'keywords': '', 'moddate': '2022-10-23T18:44:56+02:00', 'trapped': '', 'modDate': "D:20221023184456+02'00'", 'creationDate': "D:20221023184456+02'00'", 'page': 0}
---end of page
--- source Kader Datakwaliteit - wet toekomst pensioenen.pdf.pdf ---
--- page 2 / 61 ---
page_content='## Inhoudsopgave

1. INLEIDING KADER

In [None]:
for page in pages[61:105]:
    print_document_page(page)

--- source pensioenregelement Stichting Pensioenfonds van De Nederlandsche Bank NV.pdf ---
--- page 1 / 44 ---
page_content='Pensioenregeling van 
 
Stichting Pensioenfonds van 
 
De Nederlandsche Bank NV 
 
(PENSIOENREGLEMENT 2021) 
 
 
(Uitgave januari 2025) 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Stichting Pensioenfonds van 
De Nederlandsche Bank NV' metadata={'producer': 'Microsoft® Word voor Microsoft 365', 'creator': 'Microsoft® Word voor Microsoft 365', 'creationdate': '2025-02-13T08:20:46+01:00', 'title': 'Pensioenregeling van', 'author': 'Jullens', 'moddate': '2025-02-13T08:20:46+01:00', 'source': 'data\\raw\\pension-martijn-files\\pensioenregelement Stichting Pensioenfonds van De Nederlandsche Bank NV.pdf', 'total_pages': 44, 'page': 0, 'page_label': '1'}
--- source pensioenregelement Stichting Pensioenfonds van De Nederlandsche Bank NV.pdf ---
--- page 2 / 44 ---
page_content='2 
Dit document bevat de tekst met de bijlagen I t/m VI van het  
Pensioenreglement 2021 van

In [36]:
for page in pages[105:]:
    print_document_page(page)

--- source PENSIOENREGLEMENT 2018 (per 01.01.2025) Van Stichting Pensioenfonds Rockwool, voor de pensioenregeling van Rockwool B.V.pdf ---
--- page 1 / 25 ---
page_content='PENSIOENREGLEMENT 2018 (per 01.01.2025) 
 
 
Van Stichting Pensioenfonds Rockwool, 
voor de pensioenregeling van Rockwool B.V.' metadata={'producer': 'Microsoft® Word voor Microsoft 365', 'creator': 'Microsoft® Word voor Microsoft 365', 'creationdate': '2025-01-20T16:52:33+01:00', 'author': 'Frank Reuling', 'moddate': '2025-01-20T16:52:33+01:00', 'source': 'data\\raw\\pension-martijn-files\\PENSIOENREGLEMENT 2018 (per 01.01.2025) Van Stichting Pensioenfonds Rockwool, voor de pensioenregeling van Rockwool B.V.pdf', 'total_pages': 25, 'page': 0, 'page_label': '1'}
---end of page
--- source PENSIOENREGLEMENT 2018 (per 01.01.2025) Van Stichting Pensioenfonds Rockwool, voor de pensioenregeling van Rockwool B.V.pdf ---
--- page 2 / 25 ---
page_content='Pensioenreglement 2018 
Stichting Pensioenfonds Rockwool 
        
Toe

# Textsplitters


In [None]:
#for brevity we will drop a few pages from solvency II to test embedding cost for cheap model
from langchain_text_splitters import RecursiveCharacterTextSplitter

print(f"DOCUMENT WITH {len(pages)} PAGES")
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,   # Each chunk will be 1000 characters
    chunk_overlap=100  # Overlap of 100 characters between chunks
)

split_pages = text_splitter.transform_documents(pages)

print(f"NUMBER OF CHUNKS: {len(split_pages)}")