# Embedding Chunks

Once the chunks were created and stored during the parsing step of the RAG pipeline, now its time to embed this chunks into a vector based on an embedding model, and then store these vectors into a vector DB.

## Import libraries

In [1]:
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct, Filter, FieldCondition, MatchValue
from sentence_transformers import SentenceTransformer
import json
import sys
import uuid

sys.path.append("..")

from utils.gcp.gcs import get_file
from rag_llm_energy_expert.config import GCP_CONFIG
from rag_llm_energy_expert.credentials import get_qdrant_config
from rag_llm_energy_expert.ingest_document import upload_document

  from .autonotebook import tqdm as notebook_tqdm


## Embeddings

Once the PDF text has been parsed, chunked, and stored in GCS, I'll use that chunks to embed the data into a vector database, I will be using [*Qdrant DB*](https://qdrant.tech/). The first thing to do is establish the embedding model to use and load the chunks stored in GCS

For this time, we'll be using the [*sentence-transformers/all-MiniLM-L6-v2*](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) model:

- Number of Parameters: 22.7M (small)
- Embedding Dimension: 384
- Max tokens: 256

In [2]:
# load the qdrant config values
qdrant_config = get_qdrant_config()
gcp_config = GCP_CONFIG()

In [6]:
model_name = "sentence-transformers/all-MiniLM-L6-v2"

# Name of the txt file where the chunks are stored
chunks_file = "chunks/resumen_reforma_energetica_256_2025-04-06.txt"

In [7]:
chunks = json.loads(get_file(chunks_file, bucket_name = gcp_config.BUCKET_NAME))

In [8]:
len(chunks)

309

Then, we wil load the embedding model from sentence-transformers

In [9]:
model = SentenceTransformer(model_name)

Create a vector for each chunk

In [10]:
for chunk_id, chunk_data in chunks.items():

    # Create the vector id
    vector_id = f"{uuid.uuid4()}"

    chunk_data["vector_id"] = vector_id

    # Embed the data into a vector
    chunk_data["vector"] = model.encode(chunk_data["text"])

Create a list of chunk's data

In [11]:
chunks_data = [chunks[f"chunk{x}"] for x in range(len(chunks))]

In [12]:
chunks_data[0]

{'Header 1': 'Resumen Ejecutivo',
 'data': '# Resumen Ejecutivo  \n-----  \n-----',
 'upload_date': '2025-04-06',
 'text': '# Resumen Ejecutivo\n\n\n-----\n\n-----\n\n## I. Introducción\n\nLa Reforma Energética es un paso decidido rumbo a la modernización del sector energético de\nnuestro país, sin privatizar las empresas públicas dedicadas a la producción y al aprovechamiento de los hidrocarburos y de la electricidad. La Reforma Energética, tanto constitucional como a\nnivel legistlación secundarias, surge del estudio y valoración de las distintas iniciativas presentadas por los partidos políticos representados en el Congreso.\n\n### La Reforma Energética tiene los siguientes objetivos y premisas fundamentales:\n\n1. Mantener la propiedad de la Nación sobre los hidrocarburos que se encuentran en el subsuelo.\n2. Modernizar y fortalecer, sin privatizar, a Petróleos Mexicanos (Pemex) y a la Comisión Federal de Electricidad (CFE) como Empresas Productivas del Estado, 100% públicas y 100%

In [13]:
vector_size = len(chunks_data[0]["vector"])

vector_size

384

Connect to the Qdrant Vector DB

In [14]:
qdrant_client = QdrantClient(url = qdrant_config.URL, api_key = qdrant_config.API_KEY.get_secret_value())

Create a new collection

In [15]:
collection_name = "energy_expert_v1"

In [16]:
if not qdrant_client.collection_exists(collection_name):
    qdrant_client.create_collection(
        collection_name = collection_name,
        vectors_config = VectorParams(size = vector_size, distance = Distance.DOT),
    )

else:
    print("Collection already exists")


Collection already exists


Generate a list of PointStruct objects

In [17]:
points = [PointStruct(
                    id = chunk_data["vector_id"],
                    vector = chunk_data["vector"],
                    payload = { key: value for key, value in chunk_data.items() if key not in ["vector_id", "vector"]}  
                       ) for chunk_data in chunks_data]

len(points)

309

In [18]:
points_title = points[0].payload["title"]

In [28]:
# Get the document's title
chunks_title = points[0]

# Finding if the document is already in the collection

# Create a filter to match payload
title_filter = Filter(
    must=[
        FieldCondition(
            key="title",
            match=MatchValue(value=points_title)
        )
    ]
)

# Scroll through all matching vectors
scroll_result = qdrant_client.scroll(
    collection_name=collection_name,
    scroll_filter=title_filter,
    limit=1 # In this case, I only need 1 vector to know if the document is already indexed
)

# Access the points
vectors = scroll_result[0]  # List of PointStruct

if len(vectors) > 0:
    print("The document that is being trying to index is already in the collection.")

else:
    print("Indexing document")
    qdrant_client.upsert(
        collection_name = collection_name,
        wait = True, 
        points = points,
    )

The document that is being trying to index is already in the collection.


## Indexing a PDF into a vector DB

This section shows the whole steps to load a PDF into a vector DB. The PDF file is stored in GCS

In [3]:
gcs_file_path = "documents/summaries/resumen_reforma_energetica.pdf"

In [None]:
#  Previous execution time: 2 min, actual time: 1 min 11 s, only using batch embeddings
upload_document(
    gcs_document_path=gcs_file_path,
    )

[32m2025-04-06 23:40:18.267[0m | [1mINFO    [0m | [36mrag_llm_energy_expert.ingest_document[0m:[36mupload_document[0m:[36m190[0m - [1membedding model successfully initialized[0m
[32m2025-04-06 23:40:18.267[0m | [1mINFO    [0m | [36mrag_llm_energy_expert.ingest_document[0m:[36mparse_file[0m:[36m43[0m - [1mParsing file...[0m
[32m2025-04-06 23:40:18.950[0m | [1mINFO    [0m | [36mrag_llm_energy_expert.parsers_auxiliars[0m:[36mextract_pdf_content[0m:[36m42[0m - [1mLoading file from GCS...[0m
[32m2025-04-06 23:40:19.381[0m | [1mINFO    [0m | [36mrag_llm_energy_expert.parsers_auxiliars[0m:[36mextract_pdf_content[0m:[36m47[0m - [1mExtracting PDF content...[0m
[32m2025-04-06 23:40:19.381[0m | [1mINFO    [0m | [36mrag_llm_energy_expert.parsers_auxiliars[0m:[36mextract_pdf_content[0m:[36m51[0m - [1mConverting to markdown format...[0m
[32m2025-04-06 23:40:40.303[0m | [1mINFO    [0m | [36mrag_llm_energy_expert.parsers_auxiliars[0m:[