# Ejercicio 7: Bases de Datos Vectoriales

## Objetivo de la práctica

Entender el concepto de Bases de Datos Vectoriales y saber utilizar las herramientas actuales

## Parte 0: Carga del Corpus

Vamos a utilizar la API de Kaggle para acceder al dataset _Wikipedia Text Corpus for NLP and LLM Projects_

El corpus está disponible desde este [link](https://www.kaggle.com/datasets/gzdekzlkaya/wikipedia-text-corpus-for-nlp-and-llm-projects?utm_source=chatgpt.com)

### Actividad

1. Carga el corpus


In [None]:
import kagglehub
from kagglehub import KaggleDatasetAdapter

In [None]:
# Set the path to the file you'd like to load
file_path = "wikipedia_text_corpus.csv"

# Load the latest version
df = kagglehub.dataset_load(
  KaggleDatasetAdapter.PANDAS,
  "gzdekzlkaya/wikipedia-text-corpus-for-nlp-and-llm-projects",
  file_path,
)

df.head()

Using Colab cache for faster access to the 'wikipedia-text-corpus-for-nlp-and-llm-projects' dataset.


Unnamed: 0.1,Unnamed: 0,text
0,1,Anovo\n\nAnovo (formerly A Novo) is a computer...
1,2,Battery indicator\n\nA battery indicator (also...
2,3,"Bob Pease\n\nRobert Allen Pease (August 22, 19..."
3,4,CAVNET\n\nCAVNET was a secure military forum w...
4,5,CLidar\n\nThe CLidar is a scientific instrumen...


## Parte 1: Generación de Embeddings

Vamos a utilizar E5 como modelo de embeddings.

La documentación de E5 está disponible desde este [link](https://huggingface.co/intfloat/e5-base-v2)

### Actividad

1. Normalizar el corpus
2. Definir una función `chunk_text`, y dividir los textos en _chunks_.
3. Generar embeddings por cada _chunk_

In [None]:
import pandas as pd
import numpy as np
from tqdm.auto import tqdm
import re

df = df.dropna(subset=["text"]).reset_index(drop=True)

# Limpieza básica
def normalize_text(s: str) -> str:
    s = re.sub(r"\s+", " ", s).strip()
    return s

df["text_norm"] = df["text"].astype(str).map(normalize_text)

df.head()

Unnamed: 0.1,Unnamed: 0,text,text_norm
0,1,Anovo\n\nAnovo (formerly A Novo) is a computer...,Anovo Anovo (formerly A Novo) is a computer se...
1,2,Battery indicator\n\nA battery indicator (also...,Battery indicator A battery indicator (also kn...
2,3,"Bob Pease\n\nRobert Allen Pease (August 22, 19...","Bob Pease Robert Allen Pease (August 22, 1940Â..."
3,4,CAVNET\n\nCAVNET was a secure military forum w...,CAVNET CAVNET was a secure military forum whic...
4,5,CLidar\n\nThe CLidar is a scientific instrumen...,CLidar The CLidar is a scientific instrument u...


In [None]:
def chunk_text(text: str, max_chars: int = 800, overlap: int = 100):
    """
    Chunking por caracteres.
    max_chars ~ 600-1000 suele funcionar bien.
    overlap ayuda a no cortar ideas a la mitad.
    """
    chunks = []
    start = 0
    n = len(text)
    while start < n:
        end = min(start + max_chars, n)
        chunk = text[start:end]
        chunk = chunk.strip()
        if len(chunk) > 0:
            chunks.append(chunk)
        if end == n:
            break
        start = max(0, end - overlap)
    return chunks

records = []
for i, row in df.iterrows():
    chunks = chunk_text(row["text_norm"], max_chars=800, overlap=100)
    for j, ch in enumerate(chunks):
        records.append({
            "doc_id": int(i),
            "chunk_id": j,
            "text": ch
        })

chunks_df = pd.DataFrame(records)
chunks_df.head(), len(chunks_df)

(   doc_id  chunk_id                                               text
 0       0         0  Anovo Anovo (formerly A Novo) is a computer se...
 1       1         0  Battery indicator A battery indicator (also kn...
 2       1         1  ad battery when in reality it indicates a prob...
 3       1         2  s that an internal standby battery needs repla...
 4       1         3  increase; in many cases the EMF remains more o...,
 79104)

In [None]:
from sentence_transformers import SentenceTransformer

MODEL_NAME = "intfloat/e5-base-v2"   # recomendado para retrieval
model = SentenceTransformer(MODEL_NAME)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [None]:
# Textos a indexar (pasajes)
passages = ["passage: " + t for t in chunks_df["text"].tolist()]

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/387 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/650 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/200 [00:00<?, ?B/s]

In [None]:
# Embeddings (N x D)
# Se debe usar normalize_embeddings=True para similitud coseno
embeddings = model.encode(
    passages,
    batch_size=16,
    show_progress_bar=True,
    convert_to_numpy=True,
    normalize_embeddings=True
).astype("float32")

Batches:   0%|          | 0/4944 [00:00<?, ?it/s]

### Guardar los Embeddings

In [None]:
import numpy as np

embeddings_file_path = "wikipedia_embeddings_e5.npy"
np.save(embeddings_file_path, embeddings)
print(f"Embeddings guardados en: {embeddings_file_path}")

Embeddings guardados en: wikipedia_embeddings_e5.npy


### Cargar los Embeddings

In [None]:
import numpy as np

embeddings_file_path = "wikipedia_embeddings_e5.npy"

# Carga el array de embeddings
embeddings = np.load(embeddings_file_path)

In [None]:
print(embeddings.shape, embeddings.dtype)

(79104, 768) float32


In [None]:
def embed_query(query: str) -> np.ndarray:
    q = "query: " + query
    vec = model.encode(
        [q],
        convert_to_numpy=True,
        normalize_embeddings=True
    ).astype("float32")
    return vec

query_text = "Battery measuring"

query_embedding = embed_query(query_text)
query_embedding.shape

(1, 768)

## Parte 2: FAISS

FAISS es una librería para búsqueda por similitud eficiente y clustering de vectores densos.

La documentación de FAISS está disponible en este [link](https://faiss.ai/index.html)

### Actividad

1. Crea un índice en FAISS
2. Carga los embeddings
3. Realiza una búsqueda a partir de una _query_

In [None]:
!pip install faiss-cpu

Collecting faiss-cpu
  Downloading faiss_cpu-1.13.2-cp310-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (7.6 kB)
Downloading faiss_cpu-1.13.2-cp310-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (23.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.8/23.8 MB[0m [31m41.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.13.2


In [None]:
# código base para FAISS
import faiss
import numpy as np

# Asumiendo `embeddings` en un array NxD
index = faiss.IndexFlatL2(embeddings.shape[1])
index.add(embeddings)

D, I = index.search(query_embedding, k=10)

In [None]:
print(chunks_df["text"].iloc[I[0].tolist()])

10176    Battery tester A battery tester is an electron...
1        Battery indicator A battery indicator (also kn...
10177    ing procedure, according to the type of batter...
37406    ils. One was connected via a series resistor t...
71872    is achieved. Accepted average float voltages f...
37409    shorting the measurement points together and p...
10481    Current sense monitor A Current Sense Monitor ...
5        otective diodes cannot be used, a battery will...
75249    s markings that match the height of a typical ...
47064    Battery management system A battery management...
Name: text, dtype: object


## Parte 3 — Vector DB #1: Qdrant (búsqueda vectorial + metadata)

### Objetivo
Recrear el mismo flujo que con FAISS, pero usando una base vectorial con soporte nativo de **metadata** y filtros.

### Qué debes implementar
1. Levantar / conectar con una instancia de Qdrant.
2. Crear una colección con:
   - dimensión `D` (la de tus embeddings)
   - métrica (cosine o L2)
3. Insertar:
   - `id`
   - `embedding`
   - `payload` (metadata: texto, título, etiquetas, etc.)
4. Consultar Top-k por similitud:
   - `query_embedding`
   - `k`

### Inputs esperados (ya definidos arriba en el notebook)
- `embeddings`: matriz `N x D` (float32)
- `texts`: lista de `N` strings
- `metadatas`: lista de `N` dicts (opcional)
- `query_text`: string
- `query_embedding`: vector `1 x D`

### Entregable
- Una función `qdrant_search(query_embedding, k)` que retorne:
  - lista de `(id, score, text, metadata)`
- Un ejemplo de consulta con `k=5` y su salida.

### Preguntas
- ¿La métrica usada fue cosine o L2? ¿Por qué?
- ¿Qué tan fácil fue filtrar por metadata en comparación con FAISS?
- ¿Qué pasa con el tiempo de respuesta cuando aumentas `k`?


In [None]:
!pip install -U qdrant-client

Collecting qdrant-client
  Downloading qdrant_client-1.16.2-py3-none-any.whl.metadata (11 kB)
Collecting portalocker<4.0,>=2.7.0 (from qdrant-client)
  Downloading portalocker-3.2.0-py3-none-any.whl.metadata (8.7 kB)
Downloading qdrant_client-1.16.2-py3-none-any.whl (377 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m377.2/377.2 kB[0m [31m25.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading portalocker-3.2.0-py3-none-any.whl (22 kB)
Installing collected packages: portalocker, qdrant-client
Successfully installed portalocker-3.2.0 qdrant-client-1.16.2


In [None]:
from qdrant_client import models, QdrantClient

client = QdrantClient(":memory:")

client.create_collection(
    collection_name="wikipedia",
    vectors_config=models.VectorParams(
        size=model.get_sentence_embedding_dimension(),  # Vector size is defined by used model
        distance=models.Distance.COSINE,
    ),
)

True

In [None]:
client.upload_points(
    collection_name="wikipedia",
    points=[
        models.PointStruct(
            id=idx,
            vector=embeddings[idx].tolist(), # Asigna el embedding correcto para este punto
            payload=row.to_dict()           # Convierte la fila del DataFrame a un diccionario para el payload
        )
        for idx, row in chunks_df.iterrows() # Itera sobre las filas del DataFrame
    ],
)

  return self._client.upload_points(


In [None]:
hits = client.query_points(
    collection_name="wikipedia",
    query=query_embedding[0].tolist(),
    limit=10,
).points

for hit in hits:
    print(hit.payload, "score:", hit.score)

{'doc_id': 1391, 'chunk_id': 0, 'text': "Battery tester A battery tester is an electronic device intended for testing the state of an electric battery, going from a simple device for testing the charge actually present in the cells and/or its voltage output, to a more comprehensive testing of the battery's condition, namely its capacity for accumulating charge and any possible flaws affecting the battery's performance and security. The most simple battery tester is a DC ammeter, that indicates the battery's charge rate. DC voltmeters can be used to estimate the charge rate of a battery, provided that its nominal voltage is known. There are many types of integrated battery testers, each one corresponding to a specific condition testing procedure, according to the type of battery being tested, such as the â€œ421â€\x9d test for lead-ac"} score: 0.8703485972057368
{'doc_id': 1, 'chunk_id': 0, 'text': "Battery indicator A battery indicator (also known as a battery gauge) is a device which g

Definición de la función `qdrant_search(query_embedding, k)`

In [None]:
def qdrant_search(query_embedding, k):
  hits = client.query_points(
    collection_name="wikipedia",
    query=query_embedding[0].tolist(),
    limit=k,
  ).points
  return hits

In [None]:
query_text = "multidimensional space"
query_embedding = embed_query(query_text)
hits = qdrant_search(query_embedding, k=5)
for hit in hits:
    print(hit.payload, "score:", hit.score)

{'doc_id': 1478, 'chunk_id': 0, 'text': 'Space (architecture) Space is one of the elements of design of architecture, as space is continuously studied for its usage. Architectural designs are created by carving space out of space, creating space out of space, and designing spaces by dividing this space using various tools, such as geometry, colours, and shapes.It is an undefined expanse of land given to an architect to define.'} score: 0.8201016726756092
{'doc_id': 9442, 'chunk_id': 12, 'text': 'y "Âµ", which is the ratio of the permeability of a specific medium to the permeability of free space "Âµ": where "Âµ"Â = 4"Ï€"Â Ã—Â 10Â NÂ A.'} score: 0.8118793176495207
{'doc_id': 4967, 'chunk_id': 0, 'text': 'List of topics in space List of Topics in Space; topics as related to outer space.'} score: 0.805716111562019
{'doc_id': 2633, 'chunk_id': 3, 'text': 'ulation or trilateration) caused by reasons such as multipath. That is, although each input is biased in some way, the observation from 

### Liberación de recursos de RAM

In [None]:
import gc

# Eliminar el DataFrame original
if 'df' in locals():
    del df

# Eliminar la lista de pasajes
if 'passages' in locals():
    del passages

# Forzar el recolector de basura de Python
gc.collect()

71

Eliminar la colección de Qdrant

In [None]:
collection_name = "wikipedia"

# Verifica si la colección existe antes de intentar eliminarla
if client.collection_exists(collection_name):
    client.delete_collection(collection_name=collection_name)

## Parte 4 — Vector DB #2: Milvus (indexación ANN y escalabilidad)

### Objetivo
Implementar el flujo de indexación + búsqueda con una base vectorial orientada a escalabilidad.

### Qué debes implementar
1. Conectar a Milvus.
2. Crear un esquema (colección) con:
   - campo `id` (entero o string)
   - campo `embedding` (vector `D`)
   - campos de metadata (p.ej., `category`, `source`, `title`)
3. Insertar `N` embeddings.
4. Crear/seleccionar un índice ANN (ej. HNSW o IVF).
5. Ejecutar consultas Top-k y recuperar textos asociados.

### Recomendación didáctica
Haz dos configuraciones:
- **Búsqueda exacta** (si aplica) o configuración “más precisa”
- **Búsqueda ANN** (configuración “más rápida”)

Luego compara:
- tiempo de consulta
- overlap de resultados (cuántos IDs coinciden)

### Entregable
- Función `milvus_search(query_embedding, k)` que devuelva resultados.
- Un mini experimento: `k=5` y `k=20` (tiempos y resultados).

### Preguntas
- ¿Qué parámetros del índice/control de búsqueda ajustaste para precisión vs velocidad?
- ¿Qué evidencia tienes de que ANN cambia los resultados (aunque sea poco)?


In [None]:
!pip install -U pymilvus



In [None]:
!pip install pymilvus[milvus_lite]

Collecting milvus-lite>=2.4.0 (from pymilvus[milvus_lite])
  Downloading milvus_lite-2.5.1-py3-none-manylinux2014_x86_64.whl.metadata (10.0 kB)
Downloading milvus_lite-2.5.1-py3-none-manylinux2014_x86_64.whl (55.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.3/55.3 MB[0m [31m15.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: milvus-lite
Successfully installed milvus-lite-2.5.1


In [None]:
from pymilvus import MilvusClient

client = MilvusClient("milvus_demo.db")

In [None]:
COLLECTION_NAME = "wikipedia"

client.create_collection(
    collection_name=COLLECTION_NAME,
    dimension=embeddings.shape[1],
    primary_field_name="id",
    vector_field_name="embedding",
    auto_id=False,
)

In [None]:
milvus_data = []

for i, row in chunks_df.iterrows():
    milvus_data.append({
        "id": int(i),
        "embedding": embeddings[i].tolist(),
        "text": row["text"],
        "doc_id": int(row["doc_id"]),
        "chunk_id": int(row["chunk_id"]),
    })

In [None]:
batch_size = 5000

for i in range(0, len(milvus_data), batch_size):
    batch = milvus_data[i:i + batch_size]
    client.insert(collection_name=COLLECTION_NAME, data=batch)
    print(f"Insertados {len(batch)} puntos. Total insertados: {i + len(batch)}")

Insertados 5000 puntos. Total insertados: 5000
Insertados 5000 puntos. Total insertados: 10000
Insertados 5000 puntos. Total insertados: 15000
Insertados 5000 puntos. Total insertados: 20000
Insertados 5000 puntos. Total insertados: 25000
Insertados 5000 puntos. Total insertados: 30000
Insertados 5000 puntos. Total insertados: 35000
Insertados 5000 puntos. Total insertados: 40000
Insertados 5000 puntos. Total insertados: 45000
Insertados 5000 puntos. Total insertados: 50000
Insertados 5000 puntos. Total insertados: 55000
Insertados 5000 puntos. Total insertados: 60000
Insertados 5000 puntos. Total insertados: 65000
Insertados 5000 puntos. Total insertados: 70000
Insertados 5000 puntos. Total insertados: 75000
Insertados 4104 puntos. Total insertados: 79104


In [None]:
def milvus_search(query_embedding, k):
  res = client.search(
      collection_name=COLLECTION_NAME,  # target collection
      data=query_embedding,  # query vectors: debe ser una lista de vectores
      limit=k,  # number of returned entities
      output_fields=["doc_id", "chunk_id", "text"],  # specifies fields to be returned
  )
  return res

In [None]:
query_text = "Battery measuring"
query_embedding = embed_query(query_text)

res = milvus_search(query_embedding.tolist(), k=10)
for hit in res:
  for hitt in hit:
    print(hitt)

{'id': 10176, 'distance': 0.8703486919403076, 'entity': {'text': "Battery tester A battery tester is an electronic device intended for testing the state of an electric battery, going from a simple device for testing the charge actually present in the cells and/or its voltage output, to a more comprehensive testing of the battery's condition, namely its capacity for accumulating charge and any possible flaws affecting the battery's performance and security. The most simple battery tester is a DC ammeter, that indicates the battery's charge rate. DC voltmeters can be used to estimate the charge rate of a battery, provided that its nominal voltage is known. There are many types of integrated battery testers, each one corresponding to a specific condition testing procedure, according to the type of battery being tested, such as the â€œ421â€\x9d test for lead-ac", 'doc_id': 1391, 'chunk_id': 0}}
{'id': 1, 'distance': 0.8618004322052002, 'entity': {'text': "Battery indicator A battery indica

## Parte 5 — Vector DB #3: Weaviate (búsqueda semántica con esquema)

### Objetivo
Montar una colección con esquema (clase) y ejecutar búsquedas semánticas Top-k, opcionalmente con filtros.

### Qué debes implementar
1. Conectar a Weaviate.
2. Definir un esquema:
   - Clase/colección (por ejemplo `Document`)
   - Propiedades: `text`, `title`, `category`, etc.
   - Vector asociado (embedding)
3. Insertar objetos con:
   - propiedades + vector
4. Consultar por similitud (Top-k) con `query_embedding`.
5. (Opcional) agregar un filtro por propiedad (metadata).

### Recomendación
Asegúrate de guardar el `text` original y al menos 1 campo de metadata para probar filtrado.

### Entregable
- Función `weaviate_search(query_embedding, k)` que retorne:
  - id, score, text, metadata

### Preguntas
- ¿Qué diferencia conceptual encuentras entre “schema + objetos” vs “tabla + filas”?
- ¿Cómo describirías el trade-off de complejidad vs expresividad?


In [None]:
!pip install -U weaviate-client[agents]



In [None]:
from weaviate.client import WeaviateClient
from weaviate.embedded import EmbeddedOptions

client = WeaviateClient(
    embedded_options=EmbeddedOptions()
)

WeaviateStartUpError: Embedded DB did not start because processes are already listening on ports http:8079 and grpc:50060use weaviate.connect_to_local(port=8079, grpc_port=50060) to connect to the existing instance

In [None]:
from weaviate.client import WeaviateClient
import weaviate.classes as wvc

client.collections.create(
    name="WikipediaChunk",
    properties=[
        wvc.config.Property(name="text", data_type=wvc.config.DataType.TEXT),
        wvc.config.Property(name="doc_id", data_type=wvc.config.DataType.INT),
        wvc.config.Property(name="chunk_id", data_type=wvc.config.DataType.INT)
    ],
    vector_config=wvc.config.Configure.Vectorizer.none(),
    vector_index_config=wvc.config.Configure.VectorIndex.hnsw(
        distance_metric=wvc.config.VectorDistances.COSINE
    )
)

WeaviateInvalidInputError: Invalid input provided: Invalid collection config create parameters: 2 validation errors for _CollectionConfigCreate
vector_config._VectorConfigCreate
  Input should be a valid dictionary or instance of _VectorConfigCreate [type=model_type, input_value=_VectorizerConfigCreate(v...ctorizers.NONE: 'none'>), input_type=_VectorizerConfigCreate]
    For further information visit https://errors.pydantic.dev/2.12/v/model_type
vector_config.list[_VectorConfigCreate]
  Input should be a valid list [type=list_type, input_value=_VectorizerConfigCreate(v...ctorizers.NONE: 'none'>), input_type=_VectorizerConfigCreate]
    For further information visit https://errors.pydantic.dev/2.12/v/list_type.

## Parte 6 — Vector Store #4: Chroma (prototipado rápido)

### Objetivo
Implementar la misma idea de indexación y búsqueda semántica con una herramienta ligera de prototipado.

### Qué debes implementar
1. Crear una colección.
2. Insertar:
   - ids
   - embeddings
   - documents (texto)
   - metadatas (opcional)
3. Consultar Top-k con `query_embedding`.

### Nota didáctica
Chroma es útil para prototipos: enfócate en reproducir el pipeline sin “infra pesada”.

### Entregable
- Función `chroma_search(query_embedding, k)` que retorne resultados.
- Una consulta con `k=5`.

### Preguntas
- ¿Qué tan fácil fue implementar todo comparado con Qdrant/Milvus?
- ¿Qué limitaciones ves para un sistema en producción?


In [None]:
!pip install -U chromadb

Collecting chromadb
  Downloading chromadb-1.4.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.2 kB)
Collecting build>=1.0.3 (from chromadb)
  Downloading build-1.3.0-py3-none-any.whl.metadata (5.6 kB)
Collecting pybase64>=1.4.1 (from chromadb)
  Downloading pybase64-1.4.3-cp312-cp312-manylinux1_x86_64.manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_5_x86_64.whl.metadata (8.7 kB)
Collecting posthog<6.0.0,>=2.4.0 (from chromadb)
  Downloading posthog-5.4.0-py3-none-any.whl.metadata (5.7 kB)
Collecting onnxruntime>=1.14.1 (from chromadb)
  Downloading onnxruntime-1.23.2-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (5.1 kB)
Collecting opentelemetry-exporter-otlp-proto-grpc>=1.2.0 (from chromadb)
  Downloading opentelemetry_exporter_otlp_proto_grpc-1.39.1-py3-none-any.whl.metadata (2.5 kB)
Collecting pypika>=0.48.9 (from chromadb)
  Downloading PyPika-0.48.9.tar.gz (67 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m

In [None]:
import chromadb

client = chromadb.Client()
print("ChromaDB client initialized in-memory.")

ChromaDB client initialized in-memory.


In [None]:
collection_name = "wikipedia_chunks_chroma"

# Get or create the collection to avoid 'already exists' error
collection = client.get_or_create_collection(name=collection_name)
print(f"Collection '{collection_name}' is ready.")

# Prepare data for insertion
ids = [str(i) for i in chunks_df.index.tolist()]
documents = chunks_df["text"].tolist()

metadatas = []
for i, row in chunks_df.iterrows():
    metadatas.append({"doc_id": int(row["doc_id"]), "chunk_id": int(row["chunk_id"])}) # Ensure int type for metadata

batch_size = 5000 # Using a batch size smaller than the reported max batch size

for i in range(0, len(ids), batch_size):
    batch_ids = ids[i:i + batch_size]
    batch_embeddings = embeddings[i:i + batch_size].tolist()
    batch_documents = documents[i:i + batch_size]
    batch_metadatas = metadatas[i:i + batch_size]

    collection.add(
        embeddings=batch_embeddings,
        documents=batch_documents,
        metadatas=batch_metadatas,
        ids=batch_ids
    )
    print(f"Inserted {len(batch_ids)} documents. Total inserted: {i + len(batch_ids)}")

print(f"Finished inserting {len(ids)} documents into '{collection_name}' collection.")

Collection 'wikipedia_chunks_chroma' is ready.
Inserted 5000 documents. Total inserted: 5000
Inserted 5000 documents. Total inserted: 10000
Inserted 5000 documents. Total inserted: 15000
Inserted 5000 documents. Total inserted: 20000
Inserted 5000 documents. Total inserted: 25000
Inserted 5000 documents. Total inserted: 30000
Inserted 5000 documents. Total inserted: 35000
Inserted 5000 documents. Total inserted: 40000
Inserted 5000 documents. Total inserted: 45000
Inserted 5000 documents. Total inserted: 50000
Inserted 5000 documents. Total inserted: 55000
Inserted 5000 documents. Total inserted: 60000
Inserted 5000 documents. Total inserted: 65000
Inserted 5000 documents. Total inserted: 70000
Inserted 5000 documents. Total inserted: 75000
Inserted 4104 documents. Total inserted: 79104
Finished inserting 79104 documents into 'wikipedia_chunks_chroma' collection.


In [None]:
def chroma_search(query_embedding, k):
    results = collection.query(
        query_embeddings=query_embedding.tolist(),
        n_results=k,
        include=['documents', 'distances', 'metadatas']
    )
    return results

In [None]:
query_text = "Battery measuring"
query_embedding = embed_query(query_text)

hits = chroma_search(query_embedding, k=5)

# Print the results in a readable format
print("ChromaDB Search Results (k=5) for query: 'Battery measuring'")
for i in range(len(hits['ids'][0])):
    print(f"---\nID: {hits['ids'][0][i]}\nScore (distance): {hits['distances'][0][i]}\nText: {hits['documents'][0][i]}\nMetadata: {hits['metadatas'][0][i]}")

ChromaDB Search Results (k=5) for query: 'Battery measuring'
---
ID: 10176
Score (distance): 0.2593029737472534
Text: Battery tester A battery tester is an electronic device intended for testing the state of an electric battery, going from a simple device for testing the charge actually present in the cells and/or its voltage output, to a more comprehensive testing of the battery's condition, namely its capacity for accumulating charge and any possible flaws affecting the battery's performance and security. The most simple battery tester is a DC ammeter, that indicates the battery's charge rate. DC voltmeters can be used to estimate the charge rate of a battery, provided that its nominal voltage is known. There are many types of integrated battery testers, each one corresponding to a specific condition testing procedure, according to the type of battery being tested, such as the â€œ421â€ test for lead-ac
Metadata: {'chunk_id': 0, 'doc_id': 1391}
---
ID: 1
Score (distance): 0.276399254

## Parte 7 — SQL + vectores: PostgreSQL/pgvector (vector search transparente)

### Objetivo
Guardar embeddings en una tabla y ejecutar una consulta SQL de similitud.

### Qué debes implementar
1. Conectar a una base PostgreSQL con `pgvector` habilitado.
2. Crear una tabla (ej. `documents`) con:
   - `id` (PK)
   - `text` (texto)
   - `embedding` (vector(D))
   - metadata (columnas adicionales)
3. Insertar todos los documentos y embeddings.
4. Consultar Top-k por similitud, ordenando por distancia.

### Fórmula conceptual (lo que implementa tu SQL)
Para una consulta `q`, buscas:
$$ argmin_d \in D \; \text{dist}(\vec{q}, \vec{d})$$
donde `dist` puede ser L2 o una variante para cosine (según configuración).

### Entregable
- Función `pgvector_search(query_embedding, k)` que ejecute SQL y devuelva:
  - id, score/distancia, text, metadata

### Preguntas
- ¿Qué tan “explicable” te parece esta aproximación vs las otras?
- ¿Qué ventajas ofrece el mundo SQL (JOIN, filtros, agregaciones)?
- ¿Qué limitaciones esperas en escalabilidad frente a bases vectoriales dedicadas?


In [None]:
!pip install psycopg2-binary

Collecting psycopg2-binary
  Downloading psycopg2_binary-2.9.11-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (4.9 kB)
Downloading psycopg2_binary-2.9.11-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (4.2 MB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/4.2 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m4.2/4.2 MB[0m [31m236.7 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.2/4.2 MB[0m [31m119.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: psycopg2-binary
Successfully installed psycopg2-binary-2.9.11


In [None]:
import psycopg2

DB_NAME = "mydatabase"
DB_USER = "myuser"
DB_PASSWORD = "mypassword"
DB_HOST = "localhost"
DB_PORT = "5432"

conn = None
cursor = None

try:
    conn = psycopg2.connect(
        dbname=DB_NAME,
        user=DB_USER,
        password=DB_PASSWORD,
        host=DB_HOST,
        port=DB_PORT
    )
    cursor = conn.cursor()

    cursor.execute("CREATE EXTENSION IF NOT EXISTS vector;")
    conn.commit()

finally:
    if cursor:
        cursor.close()
    if conn:
        conn.close()


OperationalError: connection to server at "localhost" (::1), port 5432 failed: Connection refused
	Is the server running on that host and accepting TCP/IP connections?
connection to server at "localhost" (127.0.0.1), port 5432 failed: Connection refused
	Is the server running on that host and accepting TCP/IP connections?
