# RAG LLM: Parsing - Chunking - Embedding - Indexing files

In this notebook it is shown how to upload a PDF document into a vector DB, mainly consists in 4 steps:

- **Parsing** is the process of extracting raw text from documents such as PDFs, .docx files, youtube videos and so on. It depends on the type of data you want to parse. For this LLM, only pdfs will be parsed.

- **Chunking** is the process of splitting the parsed text into small chunks, which will be then embedded.

- **Embedding** is the process of convertir the chunked text into a dense vector.

- **Indexing** is the process of inserting the embeddings into a vector database collection.

## Load libraries

In [18]:
import pymupdf
from langchain_text_splitters import TokenTextSplitter 
from transformers import AutoTokenizer
from sentence_transformers import SentenceTransformer
from qdrant_client.models import PointStruct
import numpy as np
import uuid
from datetime import datetime

import sys

sys.path.append("..")

from utils.gcp.gcs import get_file
from utils.vector_db.qdrant import create_collection, update_points
from rag_llm_energy_expert.config import GCP_CONFIG, QDRANT_CONFIG

## Initialize config classes

In [19]:
gcp_config = GCP_CONFIG()
qdrant_config = QDRANT_CONFIG()

### Parsing PDFs

There are tons of libraries to extract data from PDFs, nevertheless, [*PyMuPDF*](https://pypi.org/project/pymupdf4llm/) is one of the best libraries because:

- Detects standard text and tables
- Header lines are identified via de font size and appropiately prefixed with one or more '#' tags.
- Bold, italic, mono-spaced text and code blocks are detected and formatted accordingly.
- By default, all document pages are processed.
- Support for pages with multiple text columns.
- Support image or vector graphic on the page and they're stored as an image.
- ***Support for page chunks***. Instead of returning one large string for the whole document, a list of dictionaries can be generated. One for each page.

*All the data parsed here comes from GCP*

Reading into memory is faster than download the pdf into a file system, and then read the file from there. Moreover, it's useful when you do not have a persistent memory or you want to work directly with the file.

### Extracting data from PDF

First, set the path to the PDF stored in GCS

In [20]:
file_to_read = f"gs://{gcp_config.BUCKET_NAME}/documents/summaries/resumen_reforma_energetica.pdf"

#local path
#file_to_read = "../data/Mexico_energy_profile.pdf"

Obtaining some metadata from the text, such as the title and the storage path that will be used as metadata within a chunk. Nevertheless, any data can be set in the metadata.

In [21]:
# Get the name / title of the file
title = file_to_read.split("/")[-1].split(".")[0]

useful_gcs_path = file_to_read[5:]

pdf_path_parts = useful_gcs_path.split("/", maxsplit= 1)
        

bucket_name = pdf_path_parts[0]
blob_name = pdf_path_parts[1]

Download in memory the PDF as bytes

In [22]:
#Loads in memory a pdf stored in GCS
pdf_bytes = get_file(gcs_file_path=blob_name, bucket_name = bucket_name)

Read the PDF using pymupdf, which creates a Document, each Document is made of Pages.

Then, using pymupdf4llm, convert the PDF text into a markdown format, returning a string with the 'markdowned' text

In [23]:
# Create a Document object, it can be constructed from a file or from memory
# pymupdf.Document() method is exactly the same as pymupdf.open()
doc = pymupdf.Document(stream = pdf_bytes)

# When comes from local:
#doc = pymupdf.Document(file_to_read)

In [24]:
pdf_text = '\n'.join([doc.get_page_text(pno = page_num) for page_num in range(len(doc))])

pdf_text

' Resumen Ejecutivo\n\n\n3\nI. Introducción\nLa Reforma Energética es un paso decidido rumbo a la modernización del sector energético de \nnuestro país, sin privatizar las empresas públicas dedicadas a la producción y al aprovechamien-\nto de los hidrocarburos y de la electricidad. La Reforma Energética, tanto constitucional como a \nnivel legistlación secundarias, surge del estudio y valoración de las distintas iniciativas presenta-\ndas por los partidos políticos representados en el Congreso.\nLa Reforma Energética tiene los siguientes objetivos y premisas fundamentales:\n1.\t\nMantener la propiedad de la Nación sobre los hidrocarburos que se encuentran en el sub-\nsuelo.\n2.\t\nModernizar y fortalecer, sin privatizar, a Petróleos Mexicanos (Pemex) y a la Comisión Fe-\nderal de Electricidad (CFE) como Empresas Productivas del Estado, 100% públicas y 100% \nmexicanas.\n3.\t\nReducir la exposición del país a los riesgos financieros, geológicos y ambientales en las ac-\ntividades de exp

## Chunking

**Chunking depends of how many tokens an embedding model supports**

There's a [HuggingFace dashboard](https://huggingface.co/spaces/mteb/leaderboard) that compares the performance of different embedding models. Some metrics to focus on are:

- Number of Parameters: 

    A higher value means the model requires more CPU/GPU memory to run

- Embedding Dimension:

    The dimension of the vectors produced

- Max tokens:

    How many tokens the model can process, the higher the better.


For this time, we'll be using the [*sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2*](https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2) model:

- Number of Parameters: 118M
- Embedding Dimension: 768
- Max tokens: 512



[*Langchain*](https://python.langchain.com/api_reference/text_splitters/base/langchain_text_splitters.base.TextSplitter.html#textsplitter) contains many *text splitters*. Most of them uses string characters to split the data, nevertheless, an *embedding model* uses *tokens* to delimit the amount of text to be embedded into a vector. To fix this, we can use a Langchain [*TokenTextSplitter*](https://python.langchain.com/api_reference/text_splitters/base/langchain_text_splitters.base.TokenTextSplitter.html) class.


The *TokenTextSplitter* object splits the text into tokens using a model tokenizer. We can use a [*HuggingFace Tokenizer*](https://huggingface.co/docs/transformers/en/main_classes/tokenizer) as an imput parameter for the TokenTextSplitter. A *tokenizer* is in charge of preparing the inputs for a model. The library contains tokenizers for all the embedding models.

In [25]:
qdrant_config.EMBEDDING_MODEL

'nomic-ai/nomic-embed-text-v2-moe'

In [26]:
#Initialize the model using the transformer library
model = SentenceTransformer(qdrant_config.EMBEDDING_MODEL, trust_remote_code=True)

In [27]:
# Initialize an AutoTokenizer instance, this will tokenize the data based on 
# the embedding model used
tokenizer = AutoTokenizer.from_pretrained(qdrant_config.EMBEDDING_MODEL)

# Create a TokenTextSplitter instance, this will use the Tokenizer to
# tokenize the text, split it based on the max tokens supported by the
# embedding model, and then convert the tokens into text again, but now splitted.
splitter = TokenTextSplitter.from_huggingface_tokenizer(
    tokenizer,
    chunk_size=model.max_seq_length,
    chunk_overlap = qdrant_config.CHUNK_OVERLAP,
)

In [28]:
chunks = splitter.split_text(pdf_text)

In [29]:
chunks[:3]

[' Resumen Ejecutivo\n\n\n3\nI. Introducción\nLa Reforma Energética es un paso decidido rumbo a la modernización del sector energético de \nnuestro país, sin privatizar las empresas públicas dedicadas a la producción y al aprovechamien-\nto de los hidrocarburos y de la electricidad. La Reforma Energética, tanto constitucional como a \nnivel legistlación secundarias, surge del estudio y valoración de las distintas iniciativas presenta-\ndas por los partidos políticos representados en el Congreso.\nLa Reforma Energética tiene los siguientes objetivos y premisas fundamentales:\n1.\t\nMantener la propiedad de la Nación sobre los hidrocarburos que se encuentran en el sub-\nsuelo.\n2.\t\nModernizar y fortalecer, sin privatizar, a Petróleos Mexicanos (Pemex) y a la Comisión Fe-\nderal de Electricidad (CFE) como Empresas Productivas del Estado, 100% públicas y 100% \nmexicanas.\n3.\t\nReducir la exposición del país a los riesgos financieros, geológicos y ambientales en las ac-\ntividades de ex

In [30]:
chunks_sizes = [len(chunk) for chunk in chunks]

print(f"Number of chunks: {len(chunks)} \n"
      f"mean chunk size (in characters): {np.mean(chunks_sizes):.4f}\n"
      f"max chunk size (in characters): {max(chunks_sizes)}"
      )

Number of chunks: 38 
mean chunk size (in characters): 1436.8421
max chunk size (in characters): 1523


## Embedding

Now that the data is splitted based on the chunk size of each embedding model. We must notice that a vector DB is the place where we will store the chunks to create the semantic search, to do so, each chunk must be embedded into a vector using the embedding model that fits the chunking size.

There are 3 key elements that define a vector in a vector DB:

- ID
- Dimensions
- Payload

The payload contains all the metadata and the text that was embedded in the vector DB. This is because once the text has been encoded, you cannot retrieve the original data from the vector. So it is necessary to store the text as metadata.

In the next sections, we'll be adding metadata to the chunks created.

In [31]:
# Adding info such as title, and the date of chunking
extra_metadata = {
    "upload_date": datetime.now().strftime(r"%Y-%m-%d"),
    "title": title,
    "storage_path": file_to_read,
    }

Adding the extra metadata to each chunk

In [32]:
# Updating the metadata with the extra metadata. This list will return None, because the update
# method does not return the dictionary itself
final_chunks = [{"text": chunk_text, "metadata": extra_metadata,} for chunk_text in chunks]

final_chunks[:2]

[{'text': ' Resumen Ejecutivo\n\n\n3\nI. Introducción\nLa Reforma Energética es un paso decidido rumbo a la modernización del sector energético de \nnuestro país, sin privatizar las empresas públicas dedicadas a la producción y al aprovechamien-\nto de los hidrocarburos y de la electricidad. La Reforma Energética, tanto constitucional como a \nnivel legistlación secundarias, surge del estudio y valoración de las distintas iniciativas presenta-\ndas por los partidos políticos representados en el Congreso.\nLa Reforma Energética tiene los siguientes objetivos y premisas fundamentales:\n1.\t\nMantener la propiedad de la Nación sobre los hidrocarburos que se encuentran en el sub-\nsuelo.\n2.\t\nModernizar y fortalecer, sin privatizar, a Petróleos Mexicanos (Pemex) y a la Comisión Fe-\nderal de Electricidad (CFE) como Empresas Productivas del Estado, 100% públicas y 100% \nmexicanas.\n3.\t\nReducir la exposición del país a los riesgos financieros, geológicos y ambientales en las ac-\ntivida

Now we will use batch embedding to create the vectors for each chunk due to its faster than creating them one by one

In [33]:
# Creating a list of chunk's data to be embedded into vectors
chunk_text = [chunk_data["text"] for chunk_data in final_chunks]

vectors = model.encode(chunk_text)

Now we weill create a PointStruct object for each chunk

Each PointStruct has:

- id: uuid string or integer
- vector: Embedding data
- payload: all the metadata associated to the vector

In [34]:
final_chunks[0]

{'text': ' Resumen Ejecutivo\n\n\n3\nI. Introducción\nLa Reforma Energética es un paso decidido rumbo a la modernización del sector energético de \nnuestro país, sin privatizar las empresas públicas dedicadas a la producción y al aprovechamien-\nto de los hidrocarburos y de la electricidad. La Reforma Energética, tanto constitucional como a \nnivel legistlación secundarias, surge del estudio y valoración de las distintas iniciativas presenta-\ndas por los partidos políticos representados en el Congreso.\nLa Reforma Energética tiene los siguientes objetivos y premisas fundamentales:\n1.\t\nMantener la propiedad de la Nación sobre los hidrocarburos que se encuentran en el sub-\nsuelo.\n2.\t\nModernizar y fortalecer, sin privatizar, a Petróleos Mexicanos (Pemex) y a la Comisión Fe-\nderal de Electricidad (CFE) como Empresas Productivas del Estado, 100% públicas y 100% \nmexicanas.\n3.\t\nReducir la exposición del país a los riesgos financieros, geológicos y ambientales en las ac-\ntividad

In [35]:
points = [PointStruct(
            id = str(uuid.uuid4()),
            vector = vectors[x],
            payload = final_chunks[x],
        )
        for x in range(len(final_chunks))]

In [36]:
points[0].payload

{'text': ' Resumen Ejecutivo\n\n\n3\nI. Introducción\nLa Reforma Energética es un paso decidido rumbo a la modernización del sector energético de \nnuestro país, sin privatizar las empresas públicas dedicadas a la producción y al aprovechamien-\nto de los hidrocarburos y de la electricidad. La Reforma Energética, tanto constitucional como a \nnivel legistlación secundarias, surge del estudio y valoración de las distintas iniciativas presenta-\ndas por los partidos políticos representados en el Congreso.\nLa Reforma Energética tiene los siguientes objetivos y premisas fundamentales:\n1.\t\nMantener la propiedad de la Nación sobre los hidrocarburos que se encuentran en el sub-\nsuelo.\n2.\t\nModernizar y fortalecer, sin privatizar, a Petróleos Mexicanos (Pemex) y a la Comisión Fe-\nderal de Electricidad (CFE) como Empresas Productivas del Estado, 100% públicas y 100% \nmexicanas.\n3.\t\nReducir la exposición del país a los riesgos financieros, geológicos y ambientales en las ac-\ntividad

## Indexing

In this last step, all the embeddings generated are stored into a vector DB collection

In [37]:
# Creating a collection inside the vector db
collection_name=qdrant_config.COLLECTION_NAME + qdrant_config.COLLECTION_VERSION

In [38]:
# This step is optional
create_collection(
    collection_name=collection_name, 
    vector_size=model[1].word_embedding_dimension, # Length of the vector
    )

# Update points is capable of update documents and upload a new one
update_points(
    collection_name=collection_name,
    points=points,
)

[32m2025-04-09 23:51:55.791[0m | [1mINFO    [0m | [36mutils.vector_db.qdrant[0m:[36mcreate_collection[0m:[36m200[0m - [1mCreating collection...[0m
[32m2025-04-09 23:51:56.035[0m | [1mINFO    [0m | [36mutils.vector_db.qdrant[0m:[36mcreate_collection[0m:[36m212[0m - [1mThe collection already exists[0m
[32m2025-04-09 23:51:56.036[0m | [1mINFO    [0m | [36mutils.vector_db.qdrant[0m:[36mupdate_points[0m:[36m160[0m - [1mUpdating points...[0m
[32m2025-04-09 23:51:56.198[0m | [1mINFO    [0m | [36mutils.vector_db.qdrant[0m:[36mdelete_document[0m:[36m85[0m - [1mDeleting previous points in the Vector DB related to the document...[0m
[32m2025-04-09 23:51:56.443[0m | [1mINFO    [0m | [36mutils.vector_db.qdrant[0m:[36mdelete_document[0m:[36m105[0m - [1mDocument resumen_reforma_energetica deleted[0m
[32m2025-04-09 23:51:56.443[0m | [1mINFO    [0m | [36mutils.vector_db.qdrant[0m:[36mupload_points[0m:[36m119[0m - [1mUploading points