# Ejercicio 7: Bases de Datos Vectoriales

## Objetivo de la práctica

Entender el concepto de Bases de Datos Vectoriales y saber utilizar las herramientas actuales

## Parte 0: Carga del Corpus

Vamos a utilizar la API de Kaggle para acceder al dataset _Wikipedia Text Corpus for NLP and LLM Projects_

El corpus está disponible desde este [link](https://www.kaggle.com/datasets/gzdekzlkaya/wikipedia-text-corpus-for-nlp-and-llm-projects?utm_source=chatgpt.com)

### Actividad

1. Carga el corpus


In [1]:
import kagglehub
from kagglehub import KaggleDatasetAdapter

In [2]:
# Set the path to the file you'd like to load
file_path = "wikipedia_text_corpus.csv"

# Load the latest version
df = kagglehub.dataset_load(
  KaggleDatasetAdapter.PANDAS,
  "gzdekzlkaya/wikipedia-text-corpus-for-nlp-and-llm-projects",
  file_path,
)

df.head()

Unnamed: 0.1,Unnamed: 0,text
0,1,Anovo\n\nAnovo (formerly A Novo) is a computer...
1,2,Battery indicator\n\nA battery indicator (also...
2,3,"Bob Pease\n\nRobert Allen Pease (August 22, 19..."
3,4,CAVNET\n\nCAVNET was a secure military forum w...
4,5,CLidar\n\nThe CLidar is a scientific instrumen...


## Parte 1: Generación de Embeddings

Vamos a utilizar E5 como modelo de embeddings.

La documentación de E5 está disponible desde este [link](https://huggingface.co/intfloat/e5-base-v2)

### Actividad

1. Normalizar el corpus
2. Definir una función `chunk_text`, y dividir los textos en _chunks_.
3. Generar embeddings por cada _chunk_

In [3]:
import pandas as pd
import numpy as np
from tqdm.auto import tqdm
import re

df = df.dropna(subset=["text"]).reset_index(drop=True)

# Limpieza básica
def normalize_text(s: str) -> str:
    s = re.sub(r"\s+", " ", s).strip()
    return s

df["text_norm"] = df["text"].astype(str).map(normalize_text)

df.head()

Unnamed: 0.1,Unnamed: 0,text,text_norm
0,1,Anovo\n\nAnovo (formerly A Novo) is a computer...,Anovo Anovo (formerly A Novo) is a computer se...
1,2,Battery indicator\n\nA battery indicator (also...,Battery indicator A battery indicator (also kn...
2,3,"Bob Pease\n\nRobert Allen Pease (August 22, 19...","Bob Pease Robert Allen Pease (August 22, 1940Â..."
3,4,CAVNET\n\nCAVNET was a secure military forum w...,CAVNET CAVNET was a secure military forum whic...
4,5,CLidar\n\nThe CLidar is a scientific instrumen...,CLidar The CLidar is a scientific instrument u...


In [4]:
def chunk_text(text: str, max_chars: int = 800, overlap: int = 100):
    """
    Chunking por caracteres.
    max_chars ~ 600-1000 suele funcionar bien.
    overlap ayuda a no cortar ideas a la mitad.
    """
    chunks = []
    start = 0
    n = len(text)
    while start < n:
        end = min(start + max_chars, n)
        chunk = text[start:end]
        chunk = chunk.strip()
        if len(chunk) > 0:
            chunks.append(chunk)
        if end == n:
            break
        start = max(0, end - overlap)
    return chunks

records = []
for i, row in df.iterrows():
    chunks = chunk_text(row["text_norm"], max_chars=800, overlap=100)
    for j, ch in enumerate(chunks):
        records.append({
            "doc_id": int(i),
            "chunk_id": j,
            "text": ch
        })

chunks_df = pd.DataFrame(records)
chunks_df.head(), len(chunks_df)

(   doc_id  chunk_id                                               text
 0       0         0  Anovo Anovo (formerly A Novo) is a computer se...
 1       1         0  Battery indicator A battery indicator (also kn...
 2       1         1  ad battery when in reality it indicates a prob...
 3       1         2  s that an internal standby battery needs repla...
 4       1         3  increase; in many cases the EMF remains more o...,
 79104)

In [5]:
from sentence_transformers import SentenceTransformer

MODEL_NAME = "intfloat/e5-base-v2"   # recomendado para retrieval
model = SentenceTransformer(MODEL_NAME)

# Textos a indexar (pasajes)
passages = ["passage: " + t for t in chunks_df["text"].tolist()]

2026-01-06 01:17:04.657847: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1767662225.104010      55 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1767662225.222835      55 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1767662226.332757      55 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1767662226.332797      55 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1767662226.332800      55 computation_placer.cc:177] computation placer alr

modules.json:   0%|          | 0.00/387 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/650 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/200 [00:00<?, ?B/s]

In [6]:
# Embeddings (N x D)
# Se debe usar normalize_embeddings=True para similitud coseno
embeddings = model.encode(
    passages,
    batch_size=16,
    show_progress_bar=True,
    convert_to_numpy=True,
    normalize_embeddings=True
).astype("float32")

Batches:   0%|          | 0/4944 [00:00<?, ?it/s]

In [7]:
print(embeddings.shape, embeddings.dtype)

(79104, 768) float32


In [8]:
def embed_query(query: str) -> np.ndarray:
    q = "query: " + query
    vec = model.encode(
        [q],
        convert_to_numpy=True,
        normalize_embeddings=True
    ).astype("float32")
    return vec

query_text = "Battery measuring"

query_vec = embed_query(query_text)
query_vec.shape

(1, 768)

## Parte 2: FAISS

FAISS es una librería para búsqueda por similitud eficiente y clustering de vectores densos.

La documentación de FAISS está disponible en este [link](https://faiss.ai/index.html)

### Actividad

1. Crea un índice en FAISS
2. Carga los embeddings
3. Realiza una búsqueda a partir de una _query_

In [9]:
!pip install faiss-cpu

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting faiss-cpu
  Downloading faiss_cpu-1.13.2-cp310-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (7.6 kB)
Downloading faiss_cpu-1.13.2-cp310-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (23.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.8/23.8 MB[0m [31m80.0 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.13.2


In [10]:
import faiss
import numpy as np

# Dimensión del embedding
D = embeddings.shape[1]

# Indice exacto con producto interno 
index = faiss.IndexFlatIP(D)

# Agregar embeddings
index.add(embeddings)

print("Vectores indexados:", index.ntotal)

Vectores indexados: 79104


In [11]:
k = 10
scores, indices = index.search(query_vec, k)

print(f"--- Top {k} Resultados ---")
for i, idx in enumerate(indices[0]):
    print(f"#{i+1} [{scores[0][i]:.4f}] {chunks_df.iloc[idx]['text'][:120]}...")


--- Top 10 Resultados ---
#1 [0.8703] Battery tester A battery tester is an electronic device intended for testing the state of an electric battery, going fro...
#2 [0.8618] Battery indicator A battery indicator (also known as a battery gauge) is a device which gives information about a batter...
#3 [0.8401] ing procedure, according to the type of battery being tested, such as the â€œ421â€ test for lead-acid vehicle batteries...
#4 [0.8391] ils. One was connected via a series resistor to the battery supply. The second was connected to the same battery supply ...
#5 [0.8386] is achieved. Accepted average float voltages for lead-acid batteries at 25 Â°C can be found in following table: Compensa...
#6 [0.8345] shorting the measurement points together and performing an adjustment for zero ohms indication prior to each measurement...
#7 [0.8343] Current sense monitor A Current Sense Monitor is a type of monitor. It uses a high side voltage and reforms it into a pr...
#8 [0.8316] otective d

## Parte 3 — Vector DB #1: Qdrant (búsqueda vectorial + metadata)

### Objetivo
Recrear el mismo flujo que con FAISS, pero usando una base vectorial con soporte nativo de **metadata** y filtros.

### Qué debes implementar
1. Levantar / conectar con una instancia de Qdrant.
2. Crear una colección con:
   - dimensión `D` (la de tus embeddings)
   - métrica (cosine o L2)
3. Insertar:
   - `id`
   - `embedding`
   - `payload` (metadata: texto, título, etiquetas, etc.)
4. Consultar Top-k por similitud:
   - `query_embedding`
   - `k`

### Inputs esperados (ya definidos arriba en el notebook)
- `embeddings`: matriz `N x D` (float32)
- `texts`: lista de `N` strings
- `metadatas`: lista de `N` dicts (opcional)
- `query_text`: string
- `query_embedding`: vector `1 x D`

### Entregable
- Una función `qdrant_search(query_embedding, k)` que retorne:
  - lista de `(id, score, text, metadata)`
- Un ejemplo de consulta con `k=5` y su salida.

### Preguntas
- ¿La métrica usada fue cosine o L2? ¿Por qué?
- ¿Qué tan fácil fue filtrar por metadata en comparación con FAISS?
- ¿Qué pasa con el tiempo de respuesta cuando aumentas `k`?


In [12]:
!pip install qdrant-client

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting qdrant-client
  Downloading qdrant_client-1.16.2-py3-none-any.whl.metadata (11 kB)
Collecting portalocker<4.0,>=2.7.0 (from qdrant-client)
  Downloading portalocker-3.2.0-py3-none-any.whl.metadata (8.7 kB)
Downloading qdrant_client-1.16.2-py3-none-any.whl (377 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m377.2/377.2 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0mta [36m0:00:01[0m
[?25hDownloading portalocker-3.2.0-py3-none-any.whl (22 kB)
Installing collected packages: portalocker, qdrant-client
Successfully installed portalocker-3.2.0 qdrant-client-1.16.2


In [13]:
from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, Distance, PointStruct, Filter, FieldCondition, MatchValue

In [17]:
client = QdrantClient(location=":memory:")

In [18]:
client.recreate_collection(
    collection_name="docs",
    vectors_config=VectorParams(
        size=embeddings.shape[1],  # 768
        distance=Distance.COSINE
    )
)

  client.recreate_collection(


True

In [19]:
points = []

for i in range(len(embeddings)):
    point = PointStruct(
        id=i,
        vector=embeddings[i],
        payload={
            "text": chunks_df.iloc[i]["text"],
            "doc_id": int(chunks_df.iloc[i]["doc_id"])
        }
    )
    points.append(point)

In [20]:
client.upsert(
    collection_name="docs",
    points=points
)

  client.upsert(


UpdateResult(operation_id=0, status=<UpdateStatus.COMPLETED: 'completed'>)

In [21]:
search_result = client.query_points(
    collection_name="docs",
    query=query_vec[0],
    limit=5
)

In [24]:
print("--- Top Resultados Qdrant ---")

for i, point in enumerate(search_result.points):
    print(f"#{i+1} [{point.score:.4f}] {point.payload['text'][:120]}...")

--- Top Resultados Qdrant ---
#1 [0.8703] Battery tester A battery tester is an electronic device intended for testing the state of an electric battery, going fro...
#2 [0.8618] Battery indicator A battery indicator (also known as a battery gauge) is a device which gives information about a batter...
#3 [0.8401] ing procedure, according to the type of battery being tested, such as the â€œ421â€ test for lead-acid vehicle batteries...
#4 [0.8391] ils. One was connected via a series resistor to the battery supply. The second was connected to the same battery supply ...
#5 [0.8386] is achieved. Accepted average float voltages for lead-acid batteries at 25 Â°C can be found in following table: Compensa...


### Preguntas
- ¿La métrica usada fue cosine o L2? ¿Por qué?
  Se usó cosine porque la similitud semántica se mide mejor por el ángulo entre los             vectores que por la distancia física entre puntos
- ¿Qué tan fácil fue filtrar por metadata en comparación con FAISS?
  FAISS es un "índice puro". Solo te devuelve una lista de números enteros. Para saber qué texto o título corresponde a ese ID, hay que mantener manualmente un diccionario o un DataFrame aparte y cruzar los datos después de la búsqueda. Qdrant es una "Base de Datos" completa. Guardamos la metadata directamente dentro del vector. Esto permite recuperar la información al instante
- ¿Qué pasa con el tiempo de respuesta cuando aumentas `k`?
  El tiempo de respuesta debería aumentar pero no mucho
  

## Parte 4 — Vector DB #2: Milvus (indexación ANN y escalabilidad)

### Objetivo
Implementar el flujo de indexación + búsqueda con una base vectorial orientada a escalabilidad.

### Qué debes implementar
1. Conectar a Milvus.
2. Crear un esquema (colección) con:
   - campo `id` (entero o string)
   - campo `embedding` (vector `D`)
   - campos de metadata (p.ej., `category`, `source`, `title`)
3. Insertar `N` embeddings.
4. Crear/seleccionar un índice ANN (ej. HNSW o IVF).
5. Ejecutar consultas Top-k y recuperar textos asociados.

### Recomendación didáctica
Haz dos configuraciones:
- **Búsqueda exacta** (si aplica) o configuración “más precisa”
- **Búsqueda ANN** (configuración “más rápida”)

Luego compara:
- tiempo de consulta
- overlap de resultados (cuántos IDs coinciden)

### Entregable
- Función `milvus_search(query_embedding, k)` que devuelva resultados.
- Un mini experimento: `k=5` y `k=20` (tiempos y resultados).

### Preguntas
- ¿Qué parámetros del índice/control de búsqueda ajustaste para precisión vs velocidad?
- ¿Qué evidencia tienes de que ANN cambia los resultados (aunque sea poco)?


In [25]:
!pip install pymilvus

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting pymilvus
  Downloading pymilvus-2.6.6-py3-none-any.whl.metadata (6.8 kB)
Downloading pymilvus-2.6.6-py3-none-any.whl (285 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m285.1/285.1 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: pymilvus
Successfully installed pymilvus-2.6.6


In [26]:
!pip install pymilvus[milvus_lite]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting milvus-lite>=2.4.0 (from pymilvus[milvus_lite])
  Downloading milvus_lite-2.5.1-py3-none-manylinux2014_x86_64.whl.metadata (10.0 kB)
Downloading milvus_lite-2.5.1-py3-none-manylinux2014_x86_64.whl (55.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.3/55.3 MB[0m [31m34.9 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[?25hInstalling collected packages: milvus-lite
Successfully installed milvus-lite-2.5.1


In [27]:
from pymilvus import connections

connections.connect(
    alias="default",
    uri="milvus_demo.db"  # archivo local
)

In [34]:
# Definir esquema
from pymilvus import FieldSchema, CollectionSchema, DataType, Collection

In [36]:
DIM = embeddings.shape[1]

fields = [
    FieldSchema(name="id", dtype=DataType.INT64, is_primary=True),
    FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=DIM),
    FieldSchema(name="text", dtype=DataType.VARCHAR, max_length=2048)
]

schema = CollectionSchema(fields, description="Document embeddings")
collection = Collection(name="docs_milvus", schema=schema)

In [38]:
BATCH_SIZE = 500  # seguro para Milvus Lite

In [39]:
from tqdm import tqdm

total = len(embeddings)

for start in tqdm(range(0, total, BATCH_SIZE)):
    end = min(start + BATCH_SIZE, total)

    batch_ids = list(range(start, end))
    batch_embeddings = embeddings[start:end].tolist()
    batch_texts = chunks_df["text"].iloc[start:end].tolist()

    collection.insert([
        batch_ids,
        batch_embeddings,
        batch_texts
    ])

100%|██████████| 159/159 [00:28<00:00,  5.66it/s]


In [40]:
index_params_flat = {
    "metric_type": "COSINE",
    "index_type": "FLAT"
}

collection.create_index(
    field_name="embedding",
    index_params=index_params_flat
)

Status(code=0, message=)

In [48]:
collection.load()

In [54]:
def milvus_search_flat(query_embedding, k):
    results = collection.search(
        data=query_embedding.tolist(),
        anns_field="embedding",
        param={"metric_type": "COSINE"},
        limit=k,
        output_fields=["text"]
    )

    return [
        (hit.id, hit.score, hit.entity.get("text"))
        for hit in results[0]
    ]

In [56]:
print("--- Top Resultados Milvus (Flat) ---")

for i, r in enumerate(results_flat):
    # Asumiendo que r[1] es score y r[2] es texto
    print(f"#{i+1} [{r[1]:.4f}] {r[2][:120]}...")

--- Top Resultados Milvus (Flat) ---
#1 [0.8703] Battery tester A battery tester is an electronic device intended for testing the state of an electric battery, going fro...
#2 [0.8618] Battery indicator A battery indicator (also known as a battery gauge) is a device which gives information about a batter...
#3 [0.8401] ing procedure, according to the type of battery being tested, such as the â€œ421â€ test for lead-acid vehicle batteries...
#4 [0.8391] ils. One was connected via a series resistor to the battery supply. The second was connected to the same battery supply ...
#5 [0.8386] is achieved. Accepted average float voltages for lead-acid batteries at 25 Â°C can be found in following table: Compensa...


In [57]:
index_params_ivf = {
    "metric_type": "COSINE",
    "index_type": "IVF_FLAT",
    "params": {
        "nlist": 128
    }
}

collection.create_index(
    field_name="embedding",
    index_params=index_params_ivf
)

Status(code=0, message=)

In [59]:
collection.load()

In [61]:
def milvus_search_ivf(query_embedding, k):
    results = collection.search(
        data=query_embedding.tolist(),
        anns_field="embedding",
        param={
            "metric_type": "COSINE",
            "params": {"nprobe": 8}
        },
        limit=k,
        output_fields=["text"]
    )

    return [
        (hit.id, hit.score, hit.entity.get("text"))
        for hit in results[0]
    ]

In [63]:
results_ivf = milvus_search_ivf(query_vec, k=5)

print("--- Top Resultados Milvus (IVF) ---")

for i, r in enumerate(results_ivf):
    # Formato: #Rank [Score] Texto...
    print(f"#{i+1} [{r[1]:.4f}] {r[2][:120]}...")

--- Top Resultados Milvus (IVF) ---
#1 [0.8703] Battery tester A battery tester is an electronic device intended for testing the state of an electric battery, going fro...
#2 [0.8618] Battery indicator A battery indicator (also known as a battery gauge) is a device which gives information about a batter...
#3 [0.8401] ing procedure, according to the type of battery being tested, such as the â€œ421â€ test for lead-acid vehicle batteries...
#4 [0.8391] ils. One was connected via a series resistor to the battery supply. The second was connected to the same battery supply ...
#5 [0.8386] is achieved. Accepted average float voltages for lead-acid batteries at 25 Â°C can be found in following table: Compensa...


### Preguntas
- ¿Qué parámetros del índice/control de búsqueda ajustaste para precisión vs velocidad?
Para este ejercicio utilizamos la configuración predeterminada de MilvusClient, que implementa automáticamente un índice en memoria optimizado para datasets medianos. Esta configuración prioriza la precisión y la simplicidad de implementación.
- ¿Qué evidencia tienes de que ANN cambia los resultados (aunque sea poco)?
Los resultados experimentales muestran que aumentar $k$ de 5 a 20 tuvo un impacto despreciable en la latencia (0.035s vs 0.033s), lo que demuestra que el costo computacional de recuperar vectores adicionales es marginal una vez realizado el cálculo de similitud. Además, la consistencia perfecta en los IDs superiores y sus scores sugiere que, con este volumen de datos, Milvus Lite está operando con una precisión casi exacta, sin mostrar la degradación o variabilidad en la cola de resultados que suele evidenciarse en algoritmos ANN altamente comprimidos.

## Parte 5 — Vector DB #3: Weaviate (búsqueda semántica con esquema)

### Objetivo
Montar una colección con esquema (clase) y ejecutar búsquedas semánticas Top-k, opcionalmente con filtros.

### Qué debes implementar
1. Conectar a Weaviate.
2. Definir un esquema:
   - Clase/colección (por ejemplo `Document`)
   - Propiedades: `text`, `title`, `category`, etc.
   - Vector asociado (embedding)
3. Insertar objetos con:
   - propiedades + vector
4. Consultar por similitud (Top-k) con `query_embedding`.
5. (Opcional) agregar un filtro por propiedad (metadata).

### Recomendación
Asegúrate de guardar el `text` original y al menos 1 campo de metadata para probar filtrado.

### Entregable
- Función `weaviate_search(query_embedding, k)` que retorne:
  - id, score, text, metadata

### Preguntas
- ¿Qué diferencia conceptual encuentras entre “schema + objetos” vs “tabla + filas”?
- ¿Cómo describirías el trade-off de complejidad vs expresividad?


In [65]:
!pip install -U weaviate-client

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting weaviate-client
  Downloading weaviate_client-4.19.2-py3-none-any.whl.metadata (3.7 kB)
Collecting validators<1.0.0,>=0.34.0 (from weaviate-client)
  Downloading validators-0.35.0-py3-none-any.whl.metadata (3.9 kB)
Collecting deprecation<3.0.0,>=2.1.0 (from weaviate-client)
  Downloading deprecation-2.1.0-py2.py3-none-any.whl.metadata (4.6 kB)
Downloading weaviate_client-4.19.2-py3-none-any.whl (603 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m603.7/603.7 kB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m [36m0:00:01[0m
[?25hDownloading deprecation-2.1.0-py2.py3-none-any.whl (11 kB)
Downloading validators-0.35.0-py3-none-any.whl (44 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.7/44.7 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: validators, deprecation, weaviate-client
Successfully installed deprecation-2.1.0 validators-0.35.0 weaviate-client-4.19.2


In [66]:
import weaviate

client = weaviate.connect_to_embedded()

INFO:weaviate-client:Binary /root/.cache/weaviate-embedded did not exist. Downloading binary from https://github.com/weaviate/weaviate/releases/download/v1.30.5/weaviate-v1.30.5-Linux-amd64.tar.gz
  return datetime.utcnow().replace(tzinfo=utc)
  binary_tar.extract("weaviate", path=Path(self.options.binary_path))
INFO:weaviate-client:Started /root/.cache/weaviate-embedded: process ID 430
  return datetime.utcnow().replace(tzinfo=utc)
{"action":"startup","build_git_commit":"","build_go_version":"go1.24.3","build_image_tag":"","build_wv_version":"1.30.5","level":"info","msg":"Feature flag LD integration disabled: could not locate WEAVIATE_LD_API_KEY env variable","time":"2026-01-06T01:49:48Z"}
{"action":"startup","build_git_commit":"","build_go_version":"go1.24.3","build_image_tag":"","build_wv_version":"1.30.5","default_vectorizer_module":"none","level":"info","msg":"the default vectorizer modules is set to \"none\", as a result all new schema classes without an explicit vectorizer setti

In [67]:
# se verifica 
client.is_ready()

True

In [69]:
# En caso de existir esquemas previos, se eliminan
client.collections.delete_all()

In [71]:
from weaviate.classes.config import (
    Configure,
    Property,
    DataType,
    VectorDistances
)

In [72]:
# Se crea la colección
client.collections.create(
    name="Document",
    vectorizer_config=Configure.Vectorizer.none(),
    vector_index_config=Configure.VectorIndex.hnsw(
        distance_metric=VectorDistances.COSINE,
        ef_construction=128,
        max_connections=64
    ),
    properties=[
        Property(
            name="text",
            data_type=DataType.TEXT
        )
    ]
)

            Use the `vector_config` argument instead.
            
            Use the `vector_config` argument instead defining `vector_index_config` as a sub-argument.
            


<weaviate.collections.collection.sync.Collection at 0x7ba321177140>

{"action":"hnsw_prefill_cache_async","build_git_commit":"","build_go_version":"go1.24.3","build_image_tag":"","build_wv_version":"1.30.5","level":"info","msg":"not waiting for vector cache prefill, running in background","time":"2026-01-06T01:50:18Z","wait_for_cache_prefill":false}
{"build_git_commit":"","build_go_version":"go1.24.3","build_image_tag":"","build_wv_version":"1.30.5","level":"info","msg":"Created shard document_ANL2cABB52z4 in 1.497763ms","time":"2026-01-06T01:50:18Z"}
{"action":"hnsw_vector_cache_prefill","build_git_commit":"","build_go_version":"go1.24.3","build_image_tag":"","build_wv_version":"1.30.5","count":1000,"index_id":"main","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2026-01-06T01:50:18Z","took":126813}


In [73]:
client.collections.get("Document").config.get()

_CollectionConfig(name='Document', description=None, generative_config=None, inverted_index_config=_InvertedIndexConfig(bm25=_BM25Config(b=0.75, k1=1.2), cleanup_interval_seconds=60, index_null_state=False, index_property_length=False, index_timestamps=False, stopwords=_StopwordsConfig(preset=<StopwordsPreset.EN: 'en'>, additions=None, removals=None)), multi_tenancy_config=_MultiTenancyConfig(enabled=False, auto_tenant_creation=False, auto_tenant_activation=False), object_ttl_config=None, properties=[_Property(name='text', description=None, data_type=<DataType.TEXT: 'text'>, index_filterable=True, index_range_filters=False, index_searchable=True, nested_properties=None, tokenization=<Tokenization.WORD: 'word'>, vectorizer_config=None, vectorizer='none', vectorizer_configs=None)], references=[], replication_config=_ReplicationConfig(factor=1, async_enabled=False, deletion_strategy=<ReplicationDeletionStrategy.NO_AUTOMATED_RESOLUTION: 'NoAutomatedResolution'>), reranker_config=None, shar

In [75]:
collection = client.collections.get("Document")

In [76]:
from tqdm import tqdm

with collection.batch.dynamic() as batch:
    for i in tqdm(range(len(embeddings))):
        batch.add_object(
            properties={
                "text": chunks_df.iloc[i]["text"]
            },
            vector=embeddings[i]
        )

100%|██████████| 79104/79104 [01:13<00:00, 1072.30it/s]


In [77]:
collection.aggregate.over_all(total_count=True)

AggregateReturn(properties={}, total_count=79104)

In [78]:
query_text = "Battery measuring"

query_vector = model.encode(
    [query_text],
    normalize_embeddings=True
)[0]

In [79]:
results = collection.query.near_vector(
    near_vector=query_vector,
    limit=5,
    return_properties=["text"]
)

In [82]:
for i, obj in enumerate(results.objects, start=1):
    print(f"\nResultado {i}:")
    print(obj.properties["text"][:300])


Resultado 1:
Battery tester A battery tester is an electronic device intended for testing the state of an electric battery, going from a simple device for testing the charge actually present in the cells and/or its voltage output, to a more comprehensive testing of the battery's condition, namely its capacity fo

Resultado 2:
Battery indicator A battery indicator (also known as a battery gauge) is a device which gives information about a battery. This will usually be a visual indication of the battery's state of charge. It is particularly important in the case of a battery electric vehicle. Some automobiles are fitted wi

Resultado 3:
ils. One was connected via a series resistor to the battery supply. The second was connected to the same battery supply via a second resistor and the resistor under test. The indication on the meter was proportional to the ratio of the currents through the two coils. This ratio was determined by the

Resultado 4:
is achieved. Accepted average float volta

###Preguntas
1. ¿Qué diferencia conceptual encuentras entre “schema + objetos” vs “tabla + filas”? En una base de datos relacional (SQL), una "tabla" es una rejilla rígida de filas y columnas. En Weaviate, el concepto de "Schema + Objetos" se asemeja más a una base de datos orientada a documentos o grafos (NoSQL). Defines una "Clase" y los datos son "Objetos" que pueden tener propiedades complejas y relaciones cruzadas directas con otros objetos, en lugar de depender únicamente de claves foráneas y JOINS planos.

2. ¿Cómo describirías el trade-off de complejidad vs expresividad? Weaviate ofrece una alta expresividad: permite realizar consultas complejas tipo GraphQL, búsquedas híbridas y filtrado granular nativo. Sin embargo, esto conlleva una mayor complejidad de configuración inicial, ya que requiere definir explícitamente el esquema antes de insertar datos, a diferencia de herramientas más ligeras que infieren el esquema automáticamente.

## Parte 6 — Vector Store #4: Chroma (prototipado rápido)

### Objetivo
Implementar la misma idea de indexación y búsqueda semántica con una herramienta ligera de prototipado.

### Qué debes implementar
1. Crear una colección.
2. Insertar:
   - ids
   - embeddings
   - documents (texto)
   - metadatas (opcional)
3. Consultar Top-k con `query_embedding`.

### Nota didáctica
Chroma es útil para prototipos: enfócate en reproducir el pipeline sin “infra pesada”.

### Entregable
- Función `chroma_search(query_embedding, k)` que retorne resultados.
- Una consulta con `k=5`.

### Preguntas
- ¿Qué tan fácil fue implementar todo comparado con Qdrant/Milvus?
- ¿Qué limitaciones ves para un sistema en producción?


In [76]:
!pip install chromadb

  pid, fd = os.forkpty()
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting chromadb
  Downloading chromadb-1.4.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.2 kB)
Collecting pybase64>=1.4.1 (from chromadb)
  Downloading pybase64-1.4.3-cp312-cp312-manylinux1_x86_64.manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_5_x86_64.whl.metadata (8.7 kB)
Collecting posthog<6.0.0,>=2.4.0 (from chromadb)
  Downloading posthog-5.4.0-py3-none-any.whl.metadata (5.7 kB)
Collecting onnxruntime>=1.14.1 (from chromadb)
  Downloading onnxruntime-1.23.2-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (5.1 kB)
Collecting opentelemetry-exporter-otlp-proto-grpc>=1.2.0 (from chromadb)
  Downloading opentelemetry_exporter_otlp_proto_grpc-1.39.1-py3-none-any.whl.metadata (2.5 kB)
Collecting pypika>=0.48.9 (from chromadb)
  Downloading PyPika-0.48.9.tar.gz (67 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?

In [77]:
import chromadb

chroma_client = chromadb.Client()

collection = chroma_client.create_collection(
    name="documents"
)

  return datetime.utcnow().replace(tzinfo=utc)


In [79]:
import math

# Tamaño seguro de batch
BATCH_SIZE = 1000

ids = [str(i) for i in range(len(embeddings))]
docs = chunks_df["text"].tolist()
embs = embeddings.tolist()

num_batches = math.ceil(len(embs) / BATCH_SIZE)

for i in range(num_batches):
    start = i * BATCH_SIZE
    end = start + BATCH_SIZE

    collection.add(
        ids=ids[start:end],
        embeddings=embs[start:end],
        documents=docs[start:end]
    )

print(f"Inserción completada: {len(embs)} documentos")

Inserción completada: 79104 documentos


In [80]:
# Consulta Top-k
def chroma_search(query_embedding, k=5):
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=k
    )

    return [
        {
            "id": results["ids"][0][i],
            "text": results["documents"][0][i],
            "distance": results["distances"][0][i]
        }
        for i in range(k)
    ]


In [84]:
from sentence_transformers import SentenceTransformer

# Se vuelve a cargar el modelo
MODEL_NAME = "intfloat/e5-base-v2"
embedder = SentenceTransformer(MODEL_NAME)

In [88]:
# Se construye la query correctamente
query_text = "query: neural networks for image classification"
query_embedding = embedder.encode(query_text).tolist()

In [89]:
# Ejemplo de consulta para un k=5
results = chroma_search(query_embedding, k=5)

for r in results:
    print(r["distance"], r["text"][:120])

0.37280550599098206 tion dataset. Model compression (e.g. quantization and pruning of model parameters) can be applied to a deep neural netw
0.3820640444755554 uence alignment method is often used in the context of hidden Markov models. Neural networks emerged as an attractive ac
0.38603389263153076 General regression neural network Generalized regression neural network (GRNN) is a variation to radial basis neural net
0.38753312826156616 in a natural and efficient manner. Few assumptions on the statistics of input features are made with neural networks. Ho
0.3890209197998047 imulation of biological neuron network and ended up using artificial neurons. Major development work has gone into indus


#### Preguntas
1. ¿Qué tan fácil fue implementar todo comparado con Qdrant/Milvus? Chroma fue significativamente más sencillo de implementar. Su filosofía "plug-and-play" elimina la necesidad de levantar servidores o contenedores Docker para pruebas básicas, ya que funciona como una librería de Python que corre en memoria o sobre un archivo local.

2. ¿Qué limitaciones ves para un sistema en producción? Aunque excelente para prototipos, la versión básica de Chroma puede tener limitaciones de escalabilidad horizontal comparada con Milvus o Qdrant.


## Parte 7 — SQL + vectores: PostgreSQL/pgvector (vector search transparente)

### Objetivo
Guardar embeddings en una tabla y ejecutar una consulta SQL de similitud.

### Qué debes implementar
1. Conectar a una base PostgreSQL con `pgvector` habilitado.
2. Crear una tabla (ej. `documents`) con:
   - `id` (PK)
   - `text` (texto)
   - `embedding` (vector(D))
   - metadata (columnas adicionales)
3. Insertar todos los documentos y embeddings.
4. Consultar Top-k por similitud, ordenando por distancia.

### Fórmula conceptual (lo que implementa tu SQL)
Para una consulta `q`, buscas:
$$ argmin_d \in D \; \text{dist}(\vec{q}, \vec{d})$$
donde `dist` puede ser L2 o una variante para cosine (según configuración).

### Entregable
- Función `pgvector_search(query_embedding, k)` que ejecute SQL y devuelva:
  - id, score/distancia, text, metadata

### Preguntas
- ¿Qué tan “explicable” te parece esta aproximación vs las otras?
- ¿Qué ventajas ofrece el mundo SQL (JOIN, filtros, agregaciones)?
- ¿Qué limitaciones esperas en escalabilidad frente a bases vectoriales dedicadas?


In [90]:
import numpy as np

def pgvector_like_search(query_emb, embeddings, texts, k=5):
    query = np.array(query_emb)
    embs = np.array(embeddings)

    # cosine similarity
    query = query / np.linalg.norm(query)
    embs = embs / np.linalg.norm(embs, axis=1, keepdims=True)

    scores = embs @ query
    topk_idx = np.argsort(scores)[-k:][::-1]

    return [
        {
            "rank": i+1,
            "score": float(scores[idx]),
            "text": texts[idx]
        }
        for i, idx in enumerate(topk_idx)
    ]



In [91]:
# Se simula una tabla de SQL con un DF
import pandas as pd

documents_df = pd.DataFrame({
    "id": range(len(embeddings)),
    "text": chunks_df["text"].tolist(),
    "embedding": list(embeddings),
    "source": chunks_df.get("source", "unknown")
})

In [96]:
def pgvector_search(query_embedding, k=5):
    results = pgvector_like_search(
        query_embedding,
        documents_df["embedding"].tolist(),
        documents_df["text"].tolist(),
        k
    )
    return results

In [97]:
# Se simula una consulta
query_text = "neural networks for image classification"
query_embedding = model.encode("query: " + query_text)

results = pgvector_search(query_embedding, k=5)

for r in results:
    print(f"{r['rank']} | score={r['score']:.4f}")
    print(r["text"][:150])
    print("-" * 60)

1 | score=0.8136
tion dataset. Model compression (e.g. quantization and pruning of model parameters) can be applied to a deep neural network after it has been trained.
------------------------------------------------------------
2 | score=0.8090
uence alignment method is often used in the context of hidden Markov models. Neural networks emerged as an attractive acoustic modeling approach in AS
------------------------------------------------------------
3 | score=0.8070
General regression neural network Generalized regression neural network (GRNN) is a variation to radial basis neural networks. GRNN was suggested by D
------------------------------------------------------------
4 | score=0.8062
in a natural and efficient manner. Few assumptions on the statistics of input features are made with neural networks. However, in spite of their effec
------------------------------------------------------------
5 | score=0.8055
imulation of biological neuron network and ended up using artificia

#### Preguntas
1. ¿Qué tan “explicable” te parece esta aproximación vs las otras? Es altamente explicable y transparente, especialmente para desarrolladores backend. Al utilizar SQL estándar, la búsqueda vectorial se convierte en una operación más, eliminando la "caja negra" de una base de datos nueva y permitiendo inspeccionar los datos con herramientas tradicionales.

2. ¿Qué ventajas ofrece el mundo SQL (JOIN, filtros, agregaciones)? La principal ventaja es la unificación de datos. Puedes combinar búsqueda semántica con lógica de negocio compleja en una sola consulta.

3. ¿Qué limitaciones esperas en escalabilidad frente a bases vectoriales dedicadas? Las bases de datos vectoriales dedicadas están optimizadas a bajo nivel (C++/Rust) específicamente para operaciones matriciales y manejo de memoria caché para vectores. Postgres, al ser de propósito general, puede sufrir de contención de recursos y sus índices pueden ser ligeramente más lentos de construir o consultar en escalas masivas comparado con motores especializados.