# Ejercicio 7: Bases de Datos Vectoriales

## Michael Perugachi

## Objetivo de la pr√°ctica

Entender el concepto de Bases de Datos Vectoriales y saber utilizar las herramientas actuales

## Parte 0: Carga del Corpus

Vamos a utilizar la API de Kaggle para acceder al dataset _Wikipedia Text Corpus for NLP and LLM Projects_

El corpus est√° disponible desde este [link](https://www.kaggle.com/datasets/gzdekzlkaya/wikipedia-text-corpus-for-nlp-and-llm-projects?utm_source=chatgpt.com)

### Actividad

1. Carga el corpus


In [3]:
import kagglehub
from kagglehub import KaggleDatasetAdapter

In [4]:
# Set the path to the file you'd like to load
file_path = "wikipedia_text_corpus.csv"

# Load the latest version
df = kagglehub.dataset_load(
  KaggleDatasetAdapter.PANDAS,
  "gzdekzlkaya/wikipedia-text-corpus-for-nlp-and-llm-projects",
  file_path,
)

df.head()

Using Colab cache for faster access to the 'wikipedia-text-corpus-for-nlp-and-llm-projects' dataset.


Unnamed: 0.1,Unnamed: 0,text
0,1,Anovo\n\nAnovo (formerly A Novo) is a computer...
1,2,Battery indicator\n\nA battery indicator (also...
2,3,"Bob Pease\n\nRobert Allen Pease (August 22, 19..."
3,4,CAVNET\n\nCAVNET was a secure military forum w...
4,5,CLidar\n\nThe CLidar is a scientific instrumen...


## Parte 1: Generaci√≥n de Embeddings

Vamos a utilizar E5 como modelo de embeddings.

La documentaci√≥n de E5 est√° disponible desde este [link](https://huggingface.co/intfloat/e5-base-v2)

### Actividad

1. Normalizar el corpus
2. Definir una funci√≥n `chunk_text`, y dividir los textos en _chunks_.
3. Generar embeddings por cada _chunk_

In [7]:
import pandas as pd
import numpy as np
from tqdm.auto import tqdm
import re

df = df.dropna(subset=["text"]).reset_index(drop=True)

# Limpieza b√°sica
def normalize_text(s: str) -> str:
    s = re.sub(r"\s+", " ", s).strip()
    return s

df["text_norm"] = df["text"].astype(str).map(normalize_text)

df.head()

Unnamed: 0.1,Unnamed: 0,text,text_norm
0,1,Anovo\n\nAnovo (formerly A Novo) is a computer...,Anovo Anovo (formerly A Novo) is a computer se...
1,2,Battery indicator\n\nA battery indicator (also...,Battery indicator A battery indicator (also kn...
2,3,"Bob Pease\n\nRobert Allen Pease (August 22, 19...","Bob Pease Robert Allen Pease (August 22, 1940√Ç..."
3,4,CAVNET\n\nCAVNET was a secure military forum w...,CAVNET CAVNET was a secure military forum whic...
4,5,CLidar\n\nThe CLidar is a scientific instrumen...,CLidar The CLidar is a scientific instrument u...


In [6]:
def chunk_text(text: str, max_chars: int = 800, overlap: int = 100):
    """
    Chunking por caracteres.
    max_chars ~ 600-1000 suele funcionar bien.
    overlap ayuda a no cortar ideas a la mitad.
    """
    chunks = []
    start = 0
    n = len(text)
    while start < n:
        end = min(start + max_chars, n)
        chunk = text[start:end]
        chunk = chunk.strip()
        if len(chunk) > 0:
            chunks.append(chunk)
        if end == n:
            break
        start = max(0, end - overlap)
    return chunks

records = []
for i, row in df.iterrows():
    chunks = chunk_text(row["text_norm"], max_chars=800, overlap=100)
    for j, ch in enumerate(chunks):
        records.append({
            "doc_id": int(i),
            "chunk_id": j,
            "text": ch
        })

chunks_df = pd.DataFrame(records)
chunks_df.head(), len(chunks_df)

(   doc_id  chunk_id                                               text
 0       0         0  Anovo Anovo (formerly A Novo) is a computer se...
 1       1         0  Battery indicator A battery indicator (also kn...
 2       1         1  ad battery when in reality it indicates a prob...
 3       1         2  s that an internal standby battery needs repla...
 4       1         3  increase; in many cases the EMF remains more o...,
 79104)

In [8]:
from sentence_transformers import SentenceTransformer

MODEL_NAME = "intfloat/e5-base-v2"   # recomendado para retrieval
model = SentenceTransformer(MODEL_NAME)

# Textos a indexar (pasajes)
passages = ["passage: " + t for t in chunks_df["text"].tolist()]

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/387 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/650 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/200 [00:00<?, ?B/s]

In [9]:
# Embeddings (N x D)
# Se debe usar normalize_embeddings=True para similitud coseno
embeddings = model.encode(
    passages,
    batch_size=16,
    show_progress_bar=True,
    convert_to_numpy=True,
    normalize_embeddings=True
).astype("float32")

Batches:   0%|          | 0/4944 [00:00<?, ?it/s]

In [10]:
print(embeddings.shape, embeddings.dtype)

(79104, 768) float32


In [11]:
def embed_query(query: str) -> np.ndarray:
    q = "query: " + query
    vec = model.encode(
        [q],
        convert_to_numpy=True,
        normalize_embeddings=True
    ).astype("float32")
    return vec

query_text = "Battery measuring"

query_vec = embed_query(query_text)
query_vec.shape

(1, 768)

## Parte 2: FAISS

FAISS es una librer√≠a para b√∫squeda por similitud eficiente y clustering de vectores densos.

La documentaci√≥n de FAISS est√° disponible en este [link](https://faiss.ai/index.html)

### Actividad

1. Crea un √≠ndice en FAISS
2. Carga los embeddings
3. Realiza una b√∫squeda a partir de una _query_

In [12]:
!pip install faiss-cpu
import faiss

# Dimensi√≥n de los embeddings
dim = embeddings.shape[1]

# Crear √≠ndice FAISS
index = faiss.IndexFlatIP(dim)

Collecting faiss-cpu
  Downloading faiss_cpu-1.13.2-cp310-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (7.6 kB)
Downloading faiss_cpu-1.13.2-cp310-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (23.8 MB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m23.8/23.8 MB[0m [31m108.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.13.2


In [13]:
# Agregar los embeddings al √≠ndice
index.add(embeddings)

# Verificar cantidad de vectores indexados
print("Vectores indexados:", index.ntotal)


Vectores indexados: 79104


In [14]:
# N√∫mero de resultados a recuperar
k = 5

# B√∫squeda en el √≠ndice
distances, indices = index.search(query_vec, k)
for i, idx in enumerate(indices[0]):
    # Add a check to ensure idx is within valid bounds
    if 0 <= idx < len(passages):
        print(f"\nResultado {i + 1}")
        print(f"Score (similitud): {distances[0][i]:.4f}")
        print(f"Texto:\n{passages[idx][:400]}...")
    else:
        print(f"\nResultado {i + 1} (√çndice inv√°lido: {idx})")
        print(f"Score (similitud): {distances[0][i]:.4f}")
        print("Texto: <No disponible debido a √≠ndice fuera de rango>")


Resultado 1
Score (similitud): 0.8703
Texto:
passage: Battery tester A battery tester is an electronic device intended for testing the state of an electric battery, going from a simple device for testing the charge actually present in the cells and/or its voltage output, to a more comprehensive testing of the battery's condition, namely its capacity for accumulating charge and any possible flaws affecting the battery's performance and securi...

Resultado 2
Score (similitud): 0.8618
Texto:
passage: Battery indicator A battery indicator (also known as a battery gauge) is a device which gives information about a battery. This will usually be a visual indication of the battery's state of charge. It is particularly important in the case of a battery electric vehicle. Some automobiles are fitted with a battery condition meter to monitor the starter battery. This meter is, essentially, a ...

Resultado 3
Score (similitud): 0.8401
Texto:
passage: ing procedure, according to the type of batte

## Parte 3 ‚Äî Vector DB #1: Qdrant (b√∫squeda vectorial + metadata)

### Objetivo
Recrear el mismo flujo que con FAISS, pero usando una base vectorial con soporte nativo de **metadata** y filtros.

### Qu√© debes implementar
1. Levantar / conectar con una instancia de Qdrant.
2. Crear una colecci√≥n con:
   - dimensi√≥n `D` (la de tus embeddings)
   - m√©trica (cosine o L2)
3. Insertar:
   - `id`
   - `embedding`
   - `payload` (metadata: texto, t√≠tulo, etiquetas, etc.)
4. Consultar Top-k por similitud:
   - `query_embedding`
   - `k`

### Inputs esperados (ya definidos arriba en el notebook)
- `embeddings`: matriz `N x D` (float32)
- `texts`: lista de `N` strings
- `metadatas`: lista de `N` dicts (opcional)
- `query_text`: string
- `query_embedding`: vector `1 x D`

### Entregable
- Una funci√≥n `qdrant_search(query_embedding, k)` que retorne:
  - lista de `(id, score, text, metadata)`
- Un ejemplo de consulta con `k=5` y su salida.

### Preguntas
**- ¬øLa m√©trica usada fue cosine o L2? ¬øPor qu√©?**
Se utilizo la metrica cosine ya que este permite medir que tan alineados estan dos vetores sin verse afectada por su magnitud. Esto es ideal para la busqueda semantica

**- ¬øQu√© tan f√°cil fue filtrar por metadata en comparaci√≥n con FAISS?**
Fue mas facil, ya que en FAISS no existen filtros nativos, por lo que filtrar metadata requiere codigo adicional y estructuras externas.
En Qdrant el filtrado por metadata es nativo y mas sencillo, ya que cada vector tiene un payload JSON y el motor permite consultas filtradas directamente

**- ¬øQu√© pasa con el tiempo de respuesta cuando aumentas `k`?**
Al aumentar k, el tiempo de respuesta tambien aumenta ligeramente, porque el motor debe devolver y ordenar mas resultados.


In [15]:
!pip install -q qdrant-client


[?25l   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m0.0/377.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m377.2/377.2 kB[0m [31m11.3 MB/s[0m eta [36m0:00:00[0m
[?25h

In [16]:
import numpy as np
from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, Distance, PointStruct


In [17]:
qdrant_client = QdrantClient(":memory:")  # Correcto para Colab
collection_name = "documents"


In [18]:
dim = embeddings.shape[1]

# Borrar colecci√≥n si ya existe (evita errores)
try:
    qdrant_client.delete_collection(collection_name)
except:
    pass

qdrant_client.create_collection(
    collection_name=collection_name,
    vectors_config=VectorParams(
        size=dim,
        distance=Distance.COSINE
    )
)


True

In [19]:
points = []

for i, emb in enumerate(embeddings):
    payload = {
        "text": passages[i]
    }

    if 'metadatas' in globals() and metadatas is not None:
        payload.update(metadatas[i])

    points.append(
        PointStruct(
            id=i,
            vector=emb.tolist(),   #  vector 1D
            payload=payload
        )
    )

qdrant_client.upsert(
    collection_name=collection_name,
    points=points
)


  qdrant_client.upsert(


UpdateResult(operation_id=0, status=<UpdateStatus.COMPLETED: 'completed'>)

In [20]:
def qdrant_search(query_embedding: np.ndarray, k: int = 5):
    """
    Retorna: (id, score, text, metadata)
    Compatible con Qdrant en Google Colab
    """

    # Asegurar vector 1D
    if query_embedding.ndim == 2:
        query_embedding = query_embedding[0]

    search_result = qdrant_client.query_points(
        collection_name=collection_name,
        query=query_embedding.tolist(),
        limit=k,
        with_payload=True
    )

    results = []
    for hit in search_result.points:
        results.append((
            hit.id,
            hit.score,
            hit.payload.get("text"),
            hit.payload
        ))

    return results


In [21]:
k = 5
results = qdrant_search(query_vec, k)

for i, (doc_id, score, text, metadata) in enumerate(results):
    print(f"\nResultado {i + 1}")
    print(f"ID: {doc_id}")
    print(f"Score (similitud): {score:.4f}")
    print(f"Texto:\n{text[:400]}...")



Resultado 1
ID: 10176
Score (similitud): 0.8703
Texto:
passage: Battery tester A battery tester is an electronic device intended for testing the state of an electric battery, going from a simple device for testing the charge actually present in the cells and/or its voltage output, to a more comprehensive testing of the battery's condition, namely its capacity for accumulating charge and any possible flaws affecting the battery's performance and securi...

Resultado 2
ID: 1
Score (similitud): 0.8618
Texto:
passage: Battery indicator A battery indicator (also known as a battery gauge) is a device which gives information about a battery. This will usually be a visual indication of the battery's state of charge. It is particularly important in the case of a battery electric vehicle. Some automobiles are fitted with a battery condition meter to monitor the starter battery. This meter is, essentially, a ...

Resultado 3
ID: 10177
Score (similitud): 0.8401
Texto:
passage: ing procedure, acco

## Parte 4 ‚Äî Vector DB #2: Milvus (indexaci√≥n ANN y escalabilidad)

### Objetivo
Implementar el flujo de indexaci√≥n + b√∫squeda con una base vectorial orientada a escalabilidad.

### Qu√© debes implementar
1. Conectar a Milvus.
2. Crear un esquema (colecci√≥n) con:
   - campo `id` (entero o string)
   - campo `embedding` (vector `D`)
   - campos de metadata (p.ej., `category`, `source`, `title`)
3. Insertar `N` embeddings.
4. Crear/seleccionar un √≠ndice ANN (ej. HNSW o IVF).
5. Ejecutar consultas Top-k y recuperar textos asociados.

### Recomendaci√≥n did√°ctica
Haz dos configuraciones:
- **B√∫squeda exacta** (si aplica) o configuraci√≥n ‚Äúm√°s precisa‚Äù
- **B√∫squeda ANN** (configuraci√≥n ‚Äúm√°s r√°pida‚Äù)

Luego compara:
- tiempo de consulta
- overlap de resultados (cu√°ntos IDs coinciden)

### Entregable
- Funci√≥n `milvus_search(query_embedding, k)` que devuelva resultados.
- Un mini experimento: `k=5` y `k=20` (tiempos y resultados).

### Preguntas
**- ¬øQu√© par√°metros del √≠ndice/control de b√∫squeda ajustaste para precisi√≥n vs velocidad?**
Se ajustaron dos parametros, primero el tipo de indice "IVF_FLAT" con "nlist=128", que habilita la busqueda aproximada(ANN)
El segundo parametro fue el de busqueda "nprobe" a un valor de 1 paera que la consulta se realice a un solo cluster haciendolo mas rapido.

**- ¬øQu√© evidencia tienes de que ANN cambia los resultados (aunque sea poco)?**
En el codigo se calcula el porcentaje de coincidencia "overlap" entre busquedas. Como el overlap es menor al 100%, se demuestra que ANN devuelve resultados diferentes.


In [22]:
!pip install -q "pymilvus[milvus_lite]"


[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m55.3/55.3 MB[0m [31m18.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m285.1/285.1 kB[0m [31m21.8 MB/s[0m eta [36m0:00:00[0m
[?25h

In [23]:
from pymilvus import MilvusClient, DataType
import numpy as np
import time

# URI SIMPLE, SIN VARIABLES RARAS
client = MilvusClient(uri="milvus_demo.db")

print("Milvus Lite iniciado correctamente")


Milvus Lite iniciado correctamente


In [24]:
collection_name = "documents_milvus"
dim = embeddings.shape[1]

if client.has_collection(collection_name):
    client.drop_collection(collection_name)

schema = MilvusClient.create_schema(
    auto_id=False,
    enable_dynamic_field=True
)

schema.add_field("id", DataType.INT64, is_primary=True)
schema.add_field("embedding", DataType.FLOAT_VECTOR, dim=dim)
schema.add_field("text", DataType.VARCHAR, max_length=2048)
schema.add_field("category", DataType.VARCHAR, max_length=50)

index_params = client.prepare_index_params()
index_params.add_index(
    field_name="embedding",
    index_type="IVF_FLAT",
    metric_type="COSINE",
    params={"nlist": 128}
)

client.create_collection(
    collection_name=collection_name,
    schema=schema,
    index_params=index_params
)


In [25]:
batch_size = 500  # seguro para Colab

for start in range(0, len(embeddings), batch_size):
    end = start + batch_size

    batch_data = [
        {
            "id": i,
            "embedding": embeddings[i].tolist(),
            "text": passages[i],
            "category": "documento"
        }
        for i in range(start, min(end, len(embeddings)))
    ]

    client.insert(
        collection_name=collection_name,
        data=batch_data
    )

    print(f"Insertados documentos {start} ‚Üí {end}")


Insertados documentos 0 ‚Üí 500
Insertados documentos 500 ‚Üí 1000
Insertados documentos 1000 ‚Üí 1500
Insertados documentos 1500 ‚Üí 2000
Insertados documentos 2000 ‚Üí 2500
Insertados documentos 2500 ‚Üí 3000
Insertados documentos 3000 ‚Üí 3500
Insertados documentos 3500 ‚Üí 4000
Insertados documentos 4000 ‚Üí 4500
Insertados documentos 4500 ‚Üí 5000
Insertados documentos 5000 ‚Üí 5500
Insertados documentos 5500 ‚Üí 6000
Insertados documentos 6000 ‚Üí 6500
Insertados documentos 6500 ‚Üí 7000
Insertados documentos 7000 ‚Üí 7500
Insertados documentos 7500 ‚Üí 8000
Insertados documentos 8000 ‚Üí 8500
Insertados documentos 8500 ‚Üí 9000
Insertados documentos 9000 ‚Üí 9500
Insertados documentos 9500 ‚Üí 10000
Insertados documentos 10000 ‚Üí 10500
Insertados documentos 10500 ‚Üí 11000
Insertados documentos 11000 ‚Üí 11500
Insertados documentos 11500 ‚Üí 12000
Insertados documentos 12000 ‚Üí 12500
Insertados documentos 12500 ‚Üí 13000
Insertados documentos 13000 ‚Üí 13500
Insertados documen

In [26]:
import time

def milvus_search(query_embedding, k, nprobe):
    """
    B√∫squeda en Milvus Lite
    """
    search_params = {
        "metric_type": "COSINE",
        "params": {"nprobe": nprobe}
    }

    start = time.time()
    results = client.search(
        collection_name=collection_name,
        data=[query_embedding.tolist()],
        limit=k,
        search_params=search_params,
        output_fields=["text", "category"]
    )
    elapsed = (time.time() - start) * 1000

    hits = []
    for hit in results[0]:
        hits.append((
            hit["id"],
            hit["distance"],
            hit["entity"]["text"],
            hit["entity"]
        ))

    return hits, elapsed


In [27]:
query_vec = query_vec.squeeze()

for k in [5, 20]:
    print(f"\nüîπ k = {k}")

    # M√°s precisa
    res_precise, t_precise = milvus_search(query_vec, k, nprobe=128)

    # M√°s r√°pida (ANN)
    res_fast, t_fast = milvus_search(query_vec, k, nprobe=1)

    ids_precise = {r[0] for r in res_precise}
    ids_fast = {r[0] for r in res_fast}

    overlap = len(ids_precise & ids_fast) / k * 100

    print(f"Precisa: {t_precise:.2f} ms")
    print(f"R√°pida:  {t_fast:.2f} ms")
    print(f"Overlap: {overlap:.1f}%")
    print(f"Ejemplo texto: {res_fast[0][2][:100]}...")



üîπ k = 5
Precisa: 46.06 ms
R√°pida:  39.87 ms
Overlap: 100.0%
Ejemplo texto: passage: Battery tester A battery tester is an electronic device intended for testing the state of a...

üîπ k = 20
Precisa: 39.71 ms
R√°pida:  40.35 ms
Overlap: 100.0%
Ejemplo texto: passage: Battery tester A battery tester is an electronic device intended for testing the state of a...


## Parte 5 ‚Äî Vector DB #3: Weaviate (b√∫squeda sem√°ntica con esquema)

### Objetivo
Montar una colecci√≥n con esquema (clase) y ejecutar b√∫squedas sem√°nticas Top-k, opcionalmente con filtros.

### Qu√© debes implementar
1. Conectar a Weaviate.
2. Definir un esquema:
   - Clase/colecci√≥n (por ejemplo `Document`)
   - Propiedades: `text`, `title`, `category`, etc.
   - Vector asociado (embedding)
3. Insertar objetos con:
   - propiedades + vector
4. Consultar por similitud (Top-k) con `query_embedding`.
5. (Opcional) agregar un filtro por propiedad (metadata).

### Recomendaci√≥n
Aseg√∫rate de guardar el `text` original y al menos 1 campo de metadata para probar filtrado.

### Entregable
- Funci√≥n `weaviate_search(query_embedding, k)` que retorne:
  - id, score, text, metadata

### Preguntas
**- ¬øQu√© diferencia conceptual encuentras entre ‚Äúschema + objetos‚Äù vs ‚Äútabla + filas‚Äù?**
El modelo "schema + objetos" permite que cada objeto almacene tanto sus datos como su embedding y metadata, integrando directamente la busqueda semantica.
El modelo "tabla + filas" de una base relacional solo maneja datos estructurados sin representacion semantica nativa, por lo que la busqueda se limita a coincidencias exactas.

**- ¬øC√≥mo describir√≠as el trade-off de complejidad vs expresividad?**

Weaviate es mas expresivo porque soporta IA y busqueda vetorial de forma nativa, pero esto tambien lo hace mas complejo de dise√±ar y operar. Las bases relacionales son mas simples y conocidas, aunque menos adecuadas para tareas semanticas.

In [28]:
!pip install weaviate-client


Collecting weaviate-client
  Downloading weaviate_client-4.19.2-py3-none-any.whl.metadata (3.7 kB)
Collecting validators<1.0.0,>=0.34.0 (from weaviate-client)
  Downloading validators-0.35.0-py3-none-any.whl.metadata (3.9 kB)
Downloading weaviate_client-4.19.2-py3-none-any.whl (603 kB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m603.7/603.7 kB[0m [31m15.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading validators-0.35.0-py3-none-any.whl (44 kB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m44.7/44.7 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: validators, weaviate-client
Successfully installed validators-0.35.0 weaviate-client-4.19.2


In [29]:
import weaviate
from weaviate.embedded import EmbeddedOptions
import time


In [30]:
import weaviate
from weaviate.embedded import EmbeddedOptions
import time

# The embedded_options argument is not directly used in connect_to_embedded() in this client version.
# Calling it without arguments will start an embedded instance with default settings.
client = weaviate.connect_to_embedded()

print("Weaviate iniciado correctamente")

INFO:weaviate-client:Binary /root/.cache/weaviate-embedded did not exist. Downloading binary from https://github.com/weaviate/weaviate/releases/download/v1.30.5/weaviate-v1.30.5-Linux-amd64.tar.gz
INFO:weaviate-client:Started /root/.cache/weaviate-embedded: process ID 7862


Weaviate iniciado correctamente


In [31]:
collection_name = "Document"

# Si existe, eliminarla
if client.collections.exists(collection_name):
    client.collections.delete(collection_name)

# Crear colecci√≥n
collection = client.collections.create(
    name=collection_name,
    vectorizer_config=None,  # embeddings externos
    properties=[
        weaviate.classes.config.Property(
            name="text",
            data_type=weaviate.classes.config.DataType.TEXT
        ),
        weaviate.classes.config.Property(
            name="category",
            data_type=weaviate.classes.config.DataType.TEXT
        )
    ]
)

print("Colecci√≥n creada correctamente")


Colecci√≥n creada correctamente


In [32]:
batch_size = 200

with collection.batch.dynamic() as batch:
    for i in range(len(embeddings)):
        batch.add_object(
            properties={
                "text": passages[i],
                "category": "documento"
            },
            vector=embeddings[i].tolist()
        )

print("Documentos insertados en Weaviate")


Documentos insertados en Weaviate


In [33]:
def weaviate_search(query_embedding, k):
    start = time.time()

    results = collection.query.near_vector(
        near_vector=query_embedding.tolist(),
        limit=k,
        return_metadata=["distance"]
    )

    elapsed = (time.time() - start) * 1000

    hits = []
    for obj in results.objects:
        hits.append({
            "id": obj.uuid,
            "score": 1 - obj.metadata.distance,
            "text": obj.properties["text"],
            "metadata": {
                "category": obj.properties["category"]
            }
        })

    return hits, elapsed


In [34]:
query_embedding = embeddings[0]

for k in [5, 20]:
    results, time_ms = weaviate_search(query_embedding, k)

    print(f"\nüîπ k={k}")
    print(f"Tiempo: {time_ms:.2f} ms")
    print(f"Ejemplo texto:\n{results[0]['text'][:150]}...")



üîπ k=5
Tiempo: 7.70 ms
Ejemplo texto:
passage: Anovo Anovo (formerly A Novo) is a computer services company based in Beauvais, France. It was founded in 1987, went public in 1999, and is c...

üîπ k=20
Tiempo: 5.74 ms
Ejemplo texto:
passage: Anovo Anovo (formerly A Novo) is a computer services company based in Beauvais, France. It was founded in 1987, went public in 1999, and is c...


## Parte 6 ‚Äî Vector Store #4: Chroma (prototipado r√°pido)

### Objetivo
Implementar la misma idea de indexaci√≥n y b√∫squeda sem√°ntica con una herramienta ligera de prototipado.

### Qu√© debes implementar
1. Crear una colecci√≥n.
2. Insertar:
   - ids
   - embeddings
   - documents (texto)
   - metadatas (opcional)
3. Consultar Top-k con `query_embedding`.

### Nota did√°ctica
Chroma es √∫til para prototipos: enf√≥cate en reproducir el pipeline sin ‚Äúinfra pesada‚Äù.

### Entregable
- Funci√≥n `chroma_search(query_embedding, k)` que retorne resultados.
- Una consulta con `k=5`.

### Preguntas
**- ¬øQu√© tan f√°cil fue implementar todo comparado con Qdrant/Milvus?**
ChromaDB fue mas facil de implementar que Qdrant y Milvus, porque no requiere definir esquemas, indices ni parametros avanzados. Solo se crea la coleccion y se insertan los embeddings directamente.

**- ¬øQu√© limitaciones ves para un sistema en producci√≥n?**

ChromaDB esta mas orientado a prototipos locales. Tiene menos soporte para escalabilidad, alta disponibilidad, control avanzado de indices y rendimiento en grandes volumenes de datos, por lo que no es tan robusto como Qdrant o Milvus para sistemas de produccion de gran escala.

In [35]:
!pip install -q chromadb


[?25l     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m0.0/67.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m67.3/67.3 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m21.7/21.7 MB[0m [31m57.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m278.2/278.2 kB[0m [31m27.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚î

In [36]:
import chromadb
from chromadb.config import Settings
import time

chroma_client = chromadb.Client()
collection = chroma_client.get_or_create_collection(
    name="documents"
)

In [37]:
# Define ids, documents, and metadatas from previously generated data
ids = [str(i) for i in range(len(embeddings))]
documents = passages  # Using 'passages' which contains the chunked texts
# Create metadatas from chunks_df, ensuring alignment with embeddings/passages
metadatas = chunks_df[['doc_id', 'chunk_id']].to_dict(orient='records')

# Split into smaller batches as ChromaDB has a batch size limit
batch_size = 5000  # A batch size smaller than 5461

for i in range(0, len(ids), batch_size):
    batch_ids = ids[i:i + batch_size]
    batch_documents = documents[i:i + batch_size]
    batch_embeddings = embeddings[i:i + batch_size]
    batch_metadatas = metadatas[i:i + batch_size]

    collection.add(
        ids=batch_ids,
        documents=batch_documents,
        embeddings=batch_embeddings,
        metadatas=batch_metadatas
    )
    print(f"Inserted batch {i//batch_size + 1} of {len(ids)//batch_size + 1}")



Inserted batch 1 of 16
Inserted batch 2 of 16
Inserted batch 3 of 16
Inserted batch 4 of 16
Inserted batch 5 of 16
Inserted batch 6 of 16
Inserted batch 7 of 16
Inserted batch 8 of 16
Inserted batch 9 of 16
Inserted batch 10 of 16
Inserted batch 11 of 16
Inserted batch 12 of 16
Inserted batch 13 of 16
Inserted batch 14 of 16
Inserted batch 15 of 16
Inserted batch 16 of 16


In [38]:
def chroma_search(query_embedding, k=5):
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=k,
        include=["documents", "metadatas", "distances"]
    )

    output = []
    for i in range(len(results["ids"][0])):
        output.append({
            "id": results["ids"][0][i],
            "score": results["distances"][0][i],
            "text": results["documents"][0][i],
            "metadata": results["metadatas"][0][i] if results["metadatas"] else None
        })

    return output


In [39]:
results = chroma_search(query_embedding, k=5)

for r in results:
    print(r)


{'id': '0', 'score': 0.0, 'text': "passage: Anovo Anovo (formerly A Novo) is a computer services company based in Beauvais, France. It was founded in 1987, went public in 1999, and is currently a member of the CAC Small. It won in the category 'Service and Repair' of the Mobile News Awards four years in a row, from 2007 to 2010. As of November 2017, they have a score of 1.6 out of 10 on the TrustPilot ratings site, with 86% of reviewers giving the company the lowest possible rating.", 'metadata': {'chunk_id': 0, 'doc_id': 0}}
{'id': '874', 'score': 0.37844347953796387, 'text': 'passage: ASBIS ASBISC Enterprises PLC is a multinational corporate group that is engaged in distribution of IT-products (mobile devices, computer software and hardware) in Europe, Middle East and Africa (EMEA) emerging markets and is headquartered in Limassol (Cyprus). ASBIS distributes a wide range of A-branded finished products and IT components to assemblers, system integrators, local brands, retail and whole

## Parte 7 ‚Äî SQL + vectores: PostgreSQL/pgvector (vector search transparente)

### Objetivo
Guardar embeddings en una tabla y ejecutar una consulta SQL de similitud.

### Qu√© debes implementar
1. Conectar a una base PostgreSQL con `pgvector` habilitado.
2. Crear una tabla (ej. `documents`) con:
   - `id` (PK)
   - `text` (texto)
   - `embedding` (vector(D))
   - metadata (columnas adicionales)
3. Insertar todos los documentos y embeddings.
4. Consultar Top-k por similitud, ordenando por distancia.

### F√≥rmula conceptual (lo que implementa tu SQL)
Para una consulta `q`, buscas:
$$ argmin_d \in D \; \text{dist}(\vec{q}, \vec{d})$$
donde `dist` puede ser L2 o una variante para cosine (seg√∫n configuraci√≥n).

### Entregable
- Funci√≥n `pgvector_search(query_embedding, k)` que ejecute SQL y devuelva:
  - id, score/distancia, text, metadata

### Preguntas
**- ¬øQu√© tan ‚Äúexplicable‚Äù te parece esta aproximaci√≥n vs las otras?**
SQL es mas explicable y transparente porque ves claramente como se calcula la similitud y como se ordenan los resultados.

**- ¬øQu√© ventajas ofrece el mundo SQL (JOIN, filtros, agregaciones)?**
SQL permite JOINs, filtros complejos y agregaciones nativas, lo que facilita combinar embeddings con datos estructurados. Esto hace que construir analisis y reportes sea muy flexible sin salirte del motor SQL.

**- ¬øQu√© limitaciones esperas en escalabilidad frente a bases vectoriales dedicadas?**
SQL no escala tan bien como Qdrant o Milvus para millones de vetores, porque normalmente usa busqueda exacta y no indices ANN optimizados. Esto aumenta la latencia y el consumo de recursos cuando crece el volumen de datos o las consultas.

In [40]:
pip install psycopg2-binary numpy



In [41]:
import duckdb
import numpy as np
import json

# --------------------------------------------------
# 1. Configuraci√≥n
# --------------------------------------------------
DIMENSION = len(embeddings[0])
con = duckdb.connect(":memory:")

# --------------------------------------------------
# 2. Crear tabla
# --------------------------------------------------
con.sql(f"""
CREATE TABLE documents (
    id INTEGER PRIMARY KEY,
    text VARCHAR,
    embedding FLOAT[{DIMENSION}],
    metadata JSON
)
""")

# --------------------------------------------------
# 3. Insertar documentos + embeddings
# --------------------------------------------------
data_to_insert = []

for i in range(len(documents)):
    vec = embeddings[i].tolist()
    meta = json.dumps(metadatas[i]) if isinstance(metadatas[i], dict) else json.dumps({})
    txt = documents[i]
    data_to_insert.append((i, txt, vec, meta))

con.executemany(
    "INSERT INTO documents VALUES (?, ?, ?, ?)",
    data_to_insert
)

print(f"Documentos insertados: {len(data_to_insert)}")

# --------------------------------------------------
# 4. Funci√≥n entregable: pgvector_search
# --------------------------------------------------
def pgvector_search(query_embedding, k=5):
    """
    Ejecuta b√∫squeda sem√°ntica Top-k usando SQL + vectores.
    Retorna: id, score/distancia, text, metadata
    """
    q_list = query_embedding.tolist() if hasattr(query_embedding, "tolist") else query_embedding

    results = con.sql(f"""
        SELECT
            id,
            (1.0 - list_cosine_similarity(embedding, {q_list}::FLOAT[{DIMENSION}])) AS distance,
            text,
            metadata
        FROM documents
        ORDER BY distance ASC
        LIMIT {k}
    """).fetchall()

    response = []
    for r in results:
        response.append({
            "id": r[0],
            "score": r[1],          # distancia (menor = m√°s similar)
            "text": r[2],
            "metadata": json.loads(r[3])
        })

    return response

# --------------------------------------------------
# 5. Consulta de prueba
# --------------------------------------------------
query_embedding = embeddings[0]  # o uno nuevo generado
results = pgvector_search(query_embedding, k=5)

for r in results:
    print(r)


Documentos insertados: 79104
{'id': 0, 'score': 0.0, 'text': "passage: Anovo Anovo (formerly A Novo) is a computer services company based in Beauvais, France. It was founded in 1987, went public in 1999, and is currently a member of the CAC Small. It won in the category 'Service and Repair' of the Mobile News Awards four years in a row, from 2007 to 2010. As of November 2017, they have a score of 1.6 out of 10 on the TrustPilot ratings site, with 86% of reviewers giving the company the lowest possible rating.", 'metadata': {'doc_id': 0, 'chunk_id': 0}}
{'id': 874, 'score': 0.18922150135040283, 'text': 'passage: ASBIS ASBISC Enterprises PLC is a multinational corporate group that is engaged in distribution of IT-products (mobile devices, computer software and hardware) in Europe, Middle East and Africa (EMEA) emerging markets and is headquartered in Limassol (Cyprus). ASBIS distributes a wide range of A-branded finished products and IT components to assemblers, system integrators, local