# RAG - LLM - Parsing - Chunking

Parsing is the process of extracting raw text from documents such as PDFs, .docx files, youtube videos and so on. It depends on the type of data you want to parse.

For this LLM, only pdfs will be parsed

## Load libraries

In [1]:
import pymupdf
import pymupdf4llm
from datetime import datetime
from langchain_text_splitters import MarkdownHeaderTextSplitter, RecursiveCharacterTextSplitter
from sentence_transformers import SentenceTransformer
import json
import sys

sys.path.append("..")

from gcp_utils.gcs import get_file, upload_file_from_memory
from rag_llm_energy_expert.config import GCP_CONFIG, LLM_CONFIG
from rag_llm_energy_expert.load_document import parse_file

  from .autonotebook import tqdm as notebook_tqdm


## Initialize config classes

In [2]:
gcp_config = GCP_CONFIG()
llm_config = LLM_CONFIG()

## Parsing PDFs

There are tons of libraries to extract data from PDFs, nevertheless, [*PyMuPDF*](https://pypi.org/project/pymupdf4llm/) is one of the best libraries because:

- Detects standard text and tables
- Header lines are identified via de font size and appropiately prefixed with one or more '#' tags.
- Bold, italic, mono-spaced text and code blocks are detected and formatted accordingly.
- By default, all document pages are processed.
- Support for pages with multiple text columns.
- Support image or vector graphic on the page and they're stored as an image.
- ***Support for page chunks***. Instead of returning one large string for the whole document, a list of dictionaries can be generated. One for each page.

*All the data parsed here comes from GCP*

Reading into memory is faster than download the pdf into a file system, and then read the file from there. Moreover, it's useful when you do not have a persistent memory or you want to work directly with the file.

### Extracting data from PDF

First, set the path to the PDF stored in GCS

In [3]:
file_to_read = "documents/summaries/resumen_reforma_energetica.pdf"

Obtaining some metadata from the text, such as the title and the storage path that will be used as metadata within a chunk. Nevertheless, any data can be set in the metadata.

In [32]:
# Get the name / title of the file
title = file_to_read.split("/")[-1].split(".")[0]

# Build the gcs storage path
storage_path = f"gs://{gcp_config.BUCKET_NAME}/{file_to_read}"


Download in memory the PDF as bytes

In [None]:
#Loads in memory a pdf stored in GCS
pdf_bytes = get_file(file_to_read)

Read the PDF using pymupdf, which creates a Document, each Document is made of Pages.

Then, using pymupdf4llm, convert the PDF text into a markdown format, returning a string with the 'markdowned' text

In [None]:
# Create a Document object, it can be constructed from a file or from memory
# pymupdf.Document() method is exactly the same as pymupdf.open()
doc = pymupdf.Document(stream = pdf_bytes)

# Reads the PDF with its metadata and creates a list of dictionaries if chunking, or a string with all the content
md_text = pymupdf4llm.to_markdown(
        doc,
        # page_chunks = True, # Create a list of pages of the Document 
        # extract_words=True, # Adds key words to each page dictionary
        show_progress = False,
    )

md_text

'# Resumen Ejecutivo\n\n\n-----\n\n-----\n\n## I. Introducción\n\nLa Reforma Energética es un paso decidido rumbo a la modernización del sector energético de\nnuestro país, sin privatizar las empresas públicas dedicadas a la producción y al aprovechamiento de los hidrocarburos y de la electricidad. La Reforma Energética, tanto constitucional como a\nnivel legistlación secundarias, surge del estudio y valoración de las distintas iniciativas presentadas por los partidos políticos representados en el Congreso.\n\n### La Reforma Energética tiene los siguientes objetivos y premisas fundamentales:\n\n1. Mantener la propiedad de la Nación sobre los hidrocarburos que se encuentran en el subsuelo.\n2. Modernizar y fortalecer, sin privatizar, a Petróleos Mexicanos (Pemex) y a la Comisión Federal de Electricidad (CFE) como Empresas Productivas del Estado, 100% públicas y 100%\nmexicanas.\n3. Reducir la exposición del país a los riesgos financieros, geológicos y ambientales en las actividades de e

### Chunking the pdf's text 

In this case, we will use the text splitters from [langchain](https://python.langchain.com/api_reference/text_splitters/index.html). Mainly, we will be using the [Markdown](https://python.langchain.com/docs/how_to/markdown_header_metadata_splitter/) and the [RecursiveCharacterTextSplitter](https://python.langchain.com/v0.1/docs/modules/data_connection/document_transformers/recursive_text_splitter/) ones.

The [Markdown](https://python.langchain.com/docs/how_to/markdown_header_metadata_splitter/) splitter splits the data based on some Markdown headers, such as '#' or '##'. It creates a list of Documents, where each Document is a split of the data. The Document contains metadata and the text splitted.

To split each of the Documents created by the Markdown splitter, we will be using the [RecursiveCharacterTextSplitter](https://python.langchain.com/v0.1/docs/modules/data_connection/document_transformers/recursive_text_splitter/) ones. Which will help us to split each Document into smaller chunks based on the parameters chunk_size and chunk_overlap, which also depends on the Embedding model to be used. The [RecursiveCharacterTextSplitter](https://python.langchain.com/v0.1/docs/modules/data_connection/document_transformers/recursive_text_splitter/) will create a list of smaller Documents.

By default, MarkdownHeaderTextSplitter strips headers being split on from the output chunk's content. This can be disabled by setting: *strip_headers = False*, also, it strips white spaces and new lines. To preserve the original formatting of your Markdown documents, checkout [ExperimentalMarkdownSyntaxTextSplitter](https://python.langchain.com/api_reference/text_splitters/markdown/langchain_text_splitters.markdown.ExperimentalMarkdownSyntaxTextSplitter.html)

In [6]:
# Choose the headers to split on
headers_to_split_on=[("#", 'Header 1'), ("##", "Header 2"), ("###", "Header 3"), ("####", "Header 4"), ("#####", "Header 5")]

# Initialize a MarkdownHeaderTextSplitter object
markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on, strip_headers=False)

# Split the text
md_header_splits = markdown_splitter.split_text(md_text)

print(len(md_header_splits))
md_header_splits

39


[Document(metadata={'Header 1': 'Resumen Ejecutivo'}, page_content='# Resumen Ejecutivo  \n-----  \n-----'),
 Document(metadata={'Header 1': 'Resumen Ejecutivo', 'Header 2': 'I. Introducción'}, page_content='## I. Introducción  \nLa Reforma Energética es un paso decidido rumbo a la modernización del sector energético de\nnuestro país, sin privatizar las empresas públicas dedicadas a la producción y al aprovechamiento de los hidrocarburos y de la electricidad. La Reforma Energética, tanto constitucional como a\nnivel legistlación secundarias, surge del estudio y valoración de las distintas iniciativas presentadas por los partidos políticos representados en el Congreso.'),
 Document(metadata={'Header 1': 'Resumen Ejecutivo', 'Header 2': 'I. Introducción', 'Header 3': 'La Reforma Energética tiene los siguientes objetivos y premisas fundamentales:'}, page_content='### La Reforma Energética tiene los siguientes objetivos y premisas fundamentales:  \n1. Mantener la propiedad de la Nación sob

Once the data has been splitted into different chunks, we can split them more to adjust it to a specific chunk size and also specify the chunk overlap per each division already created. To do so, we can then apply any text splitter we want, such as RecursiveCharacterTextSplitter

- **Chunking depends of how many tokens an embedding model supports**

There's a [HuggingFace dashboard](https://huggingface.co/spaces/mteb/leaderboard) that compares the performacne of different embedding models. Some metrics to focus on are:

- Number of Parameters: 

    A higher value means the model requires more CPU/GPU memory to run

- Embedding Dimension:

    The dimension of the vectors produced

- Max tokens:

    How many tokens the model can process, the higher the better.


For this time, we'll be using the [*sentence-transformers/all-MiniLM-L6-v2*](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) model:

- Number of Parameters: 22.7M (small)
- Embedding Dimension: 384
- Max tokens: 256

Another model that will be used is [*sentence-transformers/all-mpnet-base-v2*](https://huggingface.co/sentence-transformers/all-mpnet-base-v2):

- Number of Parameters: 109M (medium)
- Embedding Dimension: 768
- Max tokens: 384

In [None]:
model_name1 = 'sentence-transformers/all-MiniLM-L6-v2'
model_name2 = "sentence-transformers/all-mpnet-base-v2"

Comparing the chunks generateed by each embedding model

In [None]:
# Splitting the initial chunks based on the embedding model 1

chunk_size1 = 256
chunk_overlap1 = 20
separators1 = [r"\n\n"]

text_splitter = RecursiveCharacterTextSplitter(chunk_size = chunk_size1, chunk_overlap = chunk_overlap1)

chunks_model1 = text_splitter.split_documents(md_header_splits)

309

In [None]:
# Splitting the initial chunks based on the embedding model 2

chunk_size2 = 384
chunk_overlap2 = 0
separators2 = [r"\n\n"]

text_splitter = RecursiveCharacterTextSplitter(chunk_size = chunk_size2, chunk_overlap = chunk_overlap2)

chunks_model2 = text_splitter.split_documents(md_header_splits)

187

In [33]:
print(f"Number of chunks generated with the embedding model {model_name1}: {len(chunks_model1)}\n"
      f"Number of chunks generated with the embedding model {model_name2}: {len(chunks_model2)}")

Number of chunks generated with the embedding model sentence-transformers/all-MiniLM-L6-v2: 309
Number of chunks generated with the embedding model sentence-transformers/all-mpnet-base-v2: 187


Comparing the chunking size of both embedding models

In [39]:
print(
    f"Chunk generated with the Markdown splitter: \n\n",
    md_header_splits[1].page_content,
    "\n\n\n\n",
    f"Chunk generated with the embedding model {model_name1} (chunk_size = {chunk_size1}):\n\n",
    chunks_model1[1].page_content, 
    "\n\n\n\n",
    f"Chunk generated with the embedding model {model_name2} (chunk_size = {chunk_size2}):\n\n",
    chunks_model2[1].page_content, 
      )

Chunk generated with the Markdown splitter: 

 ## I. Introducción  
La Reforma Energética es un paso decidido rumbo a la modernización del sector energético de
nuestro país, sin privatizar las empresas públicas dedicadas a la producción y al aprovechamiento de los hidrocarburos y de la electricidad. La Reforma Energética, tanto constitucional como a
nivel legistlación secundarias, surge del estudio y valoración de las distintas iniciativas presentadas por los partidos políticos representados en el Congreso. 



 Chunk generated with the embedding model sentence-transformers/all-MiniLM-L6-v2 (chunk_size = 256):

 ## I. Introducción  
La Reforma Energética es un paso decidido rumbo a la modernización del sector energético de 



 Chunk generated with the embedding model sentence-transformers/all-mpnet-base-v2 (chunk_size = 384):

 ## I. Introducción  
La Reforma Energética es un paso decidido rumbo a la modernización del sector energético de
nuestro país, sin privatizar las empresas públ

Now that the data is splitted based on the chunk size of each embedding model. We must notice that a vector DB is the place where we will store the chunks to create the semantic search, to do so, each chunk must be embedded into a vector using the embedding model that fits the chunking size.

There are 3 key elements that define a vector in a vector DB:

- ID
- Dimensions
- Payload

The payload contains all the metadata and the text that was embedded in the vector DB. This is because once the text has been encoded, you cannot retrieve the original data from the vector. So it is necessary to store the text as metadata.

In the next sections, we'll be adding metadata to the chunks created.

In [40]:
# Adding info such as title, and the date of chunking
extra_metadata = {
    "upload_date": datetime.now().strftime(r"%Y-%m%d %H:%M:%S"),
    "title": title,
    "storage_path": storage_path,
    }

# Creating a list of metadata for each chunk of each embedding model
text_metadata1 = [doc.metadata for doc in chunks_model1]
text_metadata2 = [doc.metadata for doc in chunks_model2]

Adding the extra metadata to each chunk

In [None]:
# Updating the metadata with the extra metadata. This list will return None, because the update
# method does not return the dictionary itself
for i, chunk_metadata in enumerate(text_metadata1):
    
    # Add the page content in the metadata
    chunk_metadata["data"] = chunks_model1[i].page_content
    
    # Add extra metadata
    chunk_metadata.update(extra_metadata)

for i, chunk_metadata in enumerate(text_metadata2):
    
    # Add the page content in the metadata
    chunk_metadata["data"] = chunks_model2[i].page_content
    
    # Add extra metadata
    chunk_metadata.update(extra_metadata)


Example of the metadata of 1 single chunk

In [14]:
text_metadata2[8]

{'Header 1': 'Resumen Ejecutivo',
 'Header 2': 'I. Introducción',
 'Header 3': 'La Reforma Energética tiene los siguientes objetivos y premisas fundamentales:',
 'data': 'la producción de gas natural de los 5 mil 700 millones de pies cúbicos diarios producidos\nactualmente a 8 mil millones en 2018 y a 10 mil 400 millones en 2025.\n4. Generar cerca de un punto porcentual más de crecimiento económico en 2018 y aproximadamente 2 puntos porcentuales más para 2025.',
 'upload_date': '2025-0327 01:02:14',
 'title': 'resumen_reforma_energetica',
 'storage_path': 'gs://rag_llm_energy_expert/documents/summaries/resumen_reforma_energetica.pdf'}

The metadata generated now contains all the necessary information to create the embeddings and store them into the vector DB. At this point, it is recommended to store this chunks. We will store this chunks as txt files in GCS.

Each text file will contain all the chunks of each PDF read. And the structure of the text inside this file will be in a json format for simplicity.

In [41]:
# Creates a JSON file that will store all the chunks of the PDF
json_file = {f"chunk{x}": metadata for x, metadata in enumerate(text_metadata2)}

# The JSON file is converted into a string to allow uploading the file into GCS directly from memory
# instead of storing it in the local system and then uploading it into GCS
json_file_plain = json.dumps(json_file, ensure_ascii=False)


print(json_file_plain)

{"chunk0": {"Header 1": "Resumen Ejecutivo", "data": "# Resumen Ejecutivo  \n-----  \n-----", "upload_date": "2025-0328 00:29:05", "title": "resumen_reforma_energetica", "storage_path": "gs://rag_llm_energy_expert/documents/summaries/resumen_reforma_energetica.pdf"}, "chunk1": {"Header 1": "Resumen Ejecutivo", "Header 2": "I. Introducción", "data": "## I. Introducción  \nLa Reforma Energética es un paso decidido rumbo a la modernización del sector energético de\nnuestro país, sin privatizar las empresas públicas dedicadas a la producción y al aprovechamiento de los hidrocarburos y de la electricidad. La Reforma Energética, tanto constitucional como a", "upload_date": "2025-0328 00:29:05", "title": "resumen_reforma_energetica", "storage_path": "gs://rag_llm_energy_expert/documents/summaries/resumen_reforma_energetica.pdf"}, "chunk2": {"Header 1": "Resumen Ejecutivo", "Header 2": "I. Introducción", "data": "nivel legistlación secundarias, surge del estudio y valoración de las distintas i

Now that we created our string that contains all the chunks of the PDF. We will store this data as a txt file inside a bucket of Cloud Storage.

In [18]:
chunks_storage_path = f"chunks/{model_name2}/"
chunk_file_name = title + ".txt"
full_chunks_path = chunks_storage_path + chunk_file_name

upload_file_from_memory(blob_name = full_chunks_path, string_data = json_file_plain)

[32m2025-03-27 01:02:58.270[0m | [1mINFO    [0m | [36mgcp_utils.gcs[0m:[36mupload_file_from_memory[0m:[36m188[0m - [1mIn-memory data successfully stored in GCS bucket[0m


At this point, we have:

- Parse the text of a PDF
- Chunk the parsed text based on the embedding model to be used later
- Store all the chunks created into a single txt file in GCS

Now, we will read this chunks from GCS and will embed it into a vector DB

All this steps can be stored into a single function called parse_file()

In [4]:
parse_file(gcs_file_path = file_to_read)

[32m2025-04-02 23:51:14.859[0m | [1mINFO    [0m | [36mrag_llm_energy_expert.load_document[0m:[36mparse_file[0m:[36m37[0m - [1mParsing file...[0m
[32m2025-04-02 23:51:15.882[0m | [1mINFO    [0m | [36mrag_llm_energy_expert.parsers_auxiliars[0m:[36mextract_pdf_content[0m:[36m40[0m - [1mLoading file from GCS...[0m
[32m2025-04-02 23:51:16.516[0m | [1mINFO    [0m | [36mrag_llm_energy_expert.parsers_auxiliars[0m:[36mextract_pdf_content[0m:[36m45[0m - [1mExtracting PDF content...[0m
[32m2025-04-02 23:51:16.516[0m | [1mINFO    [0m | [36mrag_llm_energy_expert.parsers_auxiliars[0m:[36mextract_pdf_content[0m:[36m49[0m - [1mConverting to markdown format...[0m
[32m2025-04-02 23:51:35.426[0m | [1mINFO    [0m | [36mrag_llm_energy_expert.parsers_auxiliars[0m:[36mextract_pdf_content[0m:[36m63[0m - [1mPDF content successfully extracted[0m
[32m2025-04-02 23:51:35.426[0m | [1mINFO    [0m | [36mrag_llm_energy_expert.parsers_auxiliars[0m:[36m

## Embeddings

Once the PDF text has been parsed, chunked, and stored in GCS, I'll use that chunks to embed the data into a vector database, I will be using [*Qdrant DB*](https://qdrant.tech/)

In [None]:
chunks = json.loads(get_file())

In [57]:
text_to_embed1 = [{"vector": doc.page_content, "metadata": text_metadata1[i]} for i, doc in enumerate(chunks_model1)]
text_to_embed2 = [{"vector": doc.page_content, "metadata": text_metadata2[i]} for i, doc in enumerate(chunks_model2)]

In [58]:
text_to_embed1[8]

{'vector': '5. Atraer mayor inversión al sector energético mexicano para impulsar el desarrollo del país.\n6. Contar con un mayor abasto de energéticos a mejores precios.\n7. Garantizar estándares internacionales de eficiencia, calidad y confiabilidad de suministro',
 'metadata': {'Header 1': 'Resumen Ejecutivo',
  'Header 2': 'I. Introducción',
  'Header 3': 'La Reforma Energética tiene los siguientes objetivos y premisas fundamentales:',
  'data': '5. Atraer mayor inversión al sector energético mexicano para impulsar el desarrollo del país.\n6. Contar con un mayor abasto de energéticos a mejores precios.\n7. Garantizar estándares internacionales de eficiencia, calidad y confiabilidad de suministro',
  'upload_date': '2025-0327 00:04:17',
  'title': 'resumen_reforma_energetica',
  'storage_path': 'gs://rag_llm_energy_expert/documents/summaries/resumen_reforma_energetica.pdf'}}

In [59]:
text_to_embed2[8]

{'vector': 'la producción de gas natural de los 5 mil 700 millones de pies cúbicos diarios producidos\nactualmente a 8 mil millones en 2018 y a 10 mil 400 millones en 2025.\n4. Generar cerca de un punto porcentual más de crecimiento económico en 2018 y aproximadamente 2 puntos porcentuales más para 2025.',
 'metadata': {'Header 1': 'Resumen Ejecutivo',
  'Header 2': 'I. Introducción',
  'Header 3': 'La Reforma Energética tiene los siguientes objetivos y premisas fundamentales:',
  'data': 'la producción de gas natural de los 5 mil 700 millones de pies cúbicos diarios producidos\nactualmente a 8 mil millones en 2018 y a 10 mil 400 millones en 2025.\n4. Generar cerca de un punto porcentual más de crecimiento económico en 2018 y aproximadamente 2 puntos porcentuales más para 2025.',
  'upload_date': '2025-0327 00:04:17',
  'title': 'resumen_reforma_energetica',
  'storage_path': 'gs://rag_llm_energy_expert/documents/summaries/resumen_reforma_energetica.pdf'}}

In [60]:
model_name1 = 'sentence-transformers/all-MiniLM-L6-v2'
model_name2 = "sentence-transformers/all-mpnet-base-v2"

model1 = SentenceTransformer(model_name1)
model2 = SentenceTransformer(model_name2)

In [61]:
for chunk in text_to_embed1:
    chunk["vector"] = model.encode(chunk["vector"])


for chunk in text_to_embed2:
    chunk["vector"] = model2.encode(chunk["vector"])

In [62]:
print(f"Vector dimension of model 1: {len(text_to_embed1[0]['vector'])}\nVector dimension of model 2: {len(text_to_embed2[0]['vector'])}")

Vector dimension of model 1: 384
Vector dimension of model 2: 768


In [67]:
text_to_embed2[0]

{'vector': array([-4.50719111e-02,  2.45857611e-02, -1.83197726e-02, -1.08759720e-02,
         3.57616059e-02, -1.00919586e-02, -8.56327266e-02,  2.42449772e-02,
        -1.79689508e-02,  1.07474606e-02,  4.62834090e-02,  7.23759532e-02,
        -3.49415164e-03, -5.37851918e-03, -3.33054066e-02, -4.98561887e-03,
        -1.91971455e-02,  2.18404345e-02, -1.91437509e-02, -3.13658416e-02,
        -1.96142457e-02,  4.73062210e-02,  2.05316441e-03,  4.13896777e-02,
         4.60003316e-02,  2.87143271e-02,  3.63672450e-02, -2.38249078e-02,
        -2.58519575e-02, -9.85185057e-03, -4.75573307e-03,  1.26582887e-02,
         1.15413424e-02, -3.64385694e-02,  1.82938800e-06, -1.97155681e-03,
        -1.24175558e-02, -9.08809155e-03, -2.71627568e-02, -3.19136307e-02,
        -2.44377013e-02, -1.02641284e-01,  2.00936142e-02, -1.01476759e-02,
        -1.96881853e-02,  3.15234326e-02,  3.10123432e-02, -8.28425288e-02,
         4.07538284e-03, -5.63355535e-03, -1.83333410e-03, -4.03550863e-02,
  

In [None]:
text_to_embed2[0]["vector"] = text_to_embed2[0]["metadata"]["data"]

files_to_store = {
    f"chunk{i}":text_to_embed1
}
a = json.dumps(text_to_embed2[0]["metadata"], indent = 4)
print(a)

{
    "Header 1": "Resumen Ejecutivo",
    "data": "# Resumen Ejecutivo  \n-----  \n-----",
    "upload_date": "2025-0327 00:04:17",
    "title": "resumen_reforma_energetica",
    "storage_path": "gs://rag_llm_energy_expert/documents/summaries/resumen_reforma_energetica.pdf"
}


In [None]:
for chunks in text_to_embed2:
    chunk_json = json.dumps(chunks)

{'vector': array([-8.76180008e-02,  7.05834106e-02,  1.35683194e-02, -7.78957643e-03,
        -3.25322412e-02,  6.38664141e-02,  2.92839911e-02,  7.83082992e-02,
        -2.15472933e-02, -4.08983268e-02,  4.60199900e-02, -1.40162259e-02,
        -1.10061623e-01, -2.49743294e-02, -5.09077013e-02, -5.50034456e-02,
        -4.99976687e-02,  3.19697075e-02,  2.60699894e-02,  4.11323532e-02,
         6.48808479e-02,  2.59540398e-02,  6.19154284e-03,  9.08319897e-04,
        -2.98400559e-02, -1.79804917e-02,  2.11581029e-02, -7.03359954e-03,
         1.13623263e-02, -7.43363351e-02,  3.64518836e-02, -1.02943201e-02,
         4.71359342e-02, -1.36588328e-02,  7.50755658e-03,  6.61125109e-02,
        -5.64336888e-02,  8.60800222e-03,  7.55290361e-03,  2.79205069e-02,
        -8.37179720e-02, -1.47666976e-01, -8.52674395e-02, -6.19034693e-02,
         1.21373180e-02, -7.65783712e-02, -5.16218171e-02,  6.28516376e-02,
        -4.22569253e-02,  1.71450805e-02, -9.01678950e-02, -9.44447070e-02,
  