# Ejercicio 6: Dense Retrieval e Introducción a FAISS

## Objetivo de la práctica

Generar embeddings con sentence-transformers (SBERT, E5), e indexar documentos con FAISS 

In [1]:
# importacion de librerias
import numpy as np
import pandas as pd

In [2]:
# --- LIMPIEZA ---
#!pip uninstall -y protobuf

# --- INSTALAR VERSION COMPATIBLE (evita el error GetPrototype) ---
#!pip install protobuf==4.21.12

# --- INSTALAR sentence-transformers, si no está ---
#!pip install sentence-transformers

# --- ACTUALIZAR transformers (si está desactualizado en Kaggle) ---
#!pip install --upgrade transformers


## Parte 0: Carga del Corpus
### Actividad

1. Carga el corpus 20 Newsgroups desde sklearn.datasets.fetch_20newsgroups.
2. Limita el corpus a los primeros 2000 documentos para facilitar el procesamiento.

In [4]:
from sklearn.datasets import fetch_20newsgroups

newsgroups = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))
docs = newsgroups.data[:2000]

print(len(docs))

2000


## Parte 2: Generación de Embeddings
### Actividad

1. Usa dos modelos de sentence-transformers. Puedes usar: `'all-MiniLM-L6-v2'` (SBERT), o `'intfloat/e5-base'` (E5). Cuando uses E5, antepon `"passage: "` a cada documento antes de codificar.
2. Genera los vectores de embeddings para todos los documentos usando el modelo seleccionado.
3. Guarda los embeddings en un array de NumPy para su posterior indexación.

In [5]:
from sentence_transformers import SentenceTransformer
import numpy as np

# Elegir modelos
model_sbert = SentenceTransformer('all-MiniLM-L6-v2')
model_e5 = SentenceTransformer('intfloat/e5-base')


2025-11-26 23:24:36.510671: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1764199476.768088      47 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1764199476.839717      47 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/387 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/645 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/356 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/200 [00:00<?, ?B/s]

In [6]:
# cargar modelos y generar embeddings
from sentence_transformers import SentenceTransformer

# Modelo SBERT
model_sbert = SentenceTransformer('all-MiniLM-L6-v2', device='cpu')

# Modelo E5
model_e5 = SentenceTransformer('intfloat/e5-base', device='cpu')

print("Modelos cargados.")


Modelos cargados.


In [7]:
embeddings_sbert = model_sbert.encode(
    docs,
    convert_to_numpy=True,
    show_progress_bar=True,
    batch_size=64
)

embeddings_sbert.shape


Batches:   0%|          | 0/32 [00:00<?, ?it/s]

(2000, 384)

In [8]:
# E5 requiere agregar "passage: " antes de cada documento
docs_e5 = [f"passage: {d}" for d in docs]

embeddings_e5 = model_e5.encode(
    docs_e5,
    convert_to_numpy=True,
    show_progress_bar=True,
    batch_size=32
)

embeddings_e5.shape


Batches:   0%|          | 0/63 [00:00<?, ?it/s]

(2000, 768)

In [9]:
import numpy as np
import os

os.makedirs("embeddings", exist_ok=True)

np.save("embeddings/sbert_embeddings.npy", embeddings_sbert)
np.save("embeddings/e5_embeddings.npy", embeddings_e5)

print("Archivos guardados en la carpeta 'embeddings/'")


Archivos guardados en la carpeta 'embeddings/'


In [10]:
model_sbert = SentenceTransformer('all-MiniLM-L6-v2', device='cpu')
model_e5 = SentenceTransformer('intfloat/e5-base', device='cpu')

## Parte 3: Consulta
### Actividad

1. Escribe una consulta en lenguaje natural. Ejemplos:

    * "God, religion, and spirituality"
    * "space exploration"
    * "car maintenance"

2. Codifica la consulta utilizando el mismo modelo de embeddings. Cuando uses E5, antepon `"query: "` a la consulta.
3. Recupera los 5 documentos más relevantes con similitud coseno.
4. Muestra los textos de los documentos recuperados (puedes mostrar solo los primeros 500 caracteres de cada uno).

In [12]:
# Celda robusta: carga/genera embeddings y realiza la consulta + recuperación Top-5
import os
import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.datasets import fetch_20newsgroups
import time

# -------------------------
# Configuración (AJUSTA si quieres)
# -------------------------
modelo = "sbert"   # "sbert" o "e5"
N_DOCS = 2000
TOP_K = 5
QUERY = "space exploration"   # <-- cambia la consulta aquí
EMB_DIR = "embeddings"
os.makedirs(EMB_DIR, exist_ok=True)

# Posibles rutas de embeddings
possible_paths = {
    "sbert": [
        os.path.join(EMB_DIR, "sbert_embeddings.npy"),
        os.path.join(EMB_DIR, "embeddings_sbert.npy"),
        "embeddings_sbert.npy",
        "embeddings.npy"
    ],
    "e5": [
        os.path.join(EMB_DIR, "e5_embeddings.npy"),
        os.path.join(EMB_DIR, "embeddings_e5.npy"),
        "embeddings_e5.npy",
        "embeddings.npy"
    ]
}

# -------------------------
# 1) Cargar corpus si no existe en memoria
# -------------------------
try:
    # si ya tienes variable en memoria (newsgroupsdocs), úsala
    newsgroupsdocs
except NameError:
    newsgroups = fetch_20newsgroups(subset='all', remove=('headers','footers','quotes'))
    newsgroupsdocs = newsgroups.data

docs = newsgroupsdocs[:N_DOCS]
print(f"[INFO] Documentos en memoria: {len(docs)} (limitado a {N_DOCS})")

# -------------------------
# 2) Buscar archivo de embeddings; si no existe, generarlo
# -------------------------
def find_existing_path(key):
    for p in possible_paths[key]:
        if os.path.exists(p):
            return p
    return None

emb_path = find_existing_path(modelo)

if emb_path:
    print(f"[INFO] Encontrado archivo de embeddings para '{modelo}': {emb_path}")
    embeddings = np.load(emb_path)
else:
    print(f"[INFO] No se encontraron embeddings preguardados para '{modelo}'. Los generaré ahora (CPU).")
    # cargar modelo y generar embeddings
    if modelo == "sbert":
        model_name = "all-MiniLM-L6-v2"
        model = SentenceTransformer(model_name, device="cpu")
        docs_for_model = docs
        save_name = os.path.join(EMB_DIR, "sbert_embeddings.npy")
        batch_size = 64
    else:  # e5
        model_name = "intfloat/e5-base"
        model = SentenceTransformer(model_name, device="cpu")
        docs_for_model = [f"passage: {d}" for d in docs]
        save_name = os.path.join(EMB_DIR, "e5_embeddings.npy")
        batch_size = 32

    print(f"[INFO] Cargando modelo {model_name}...")
    t0 = time.time()
    embeddings = model.encode(docs_for_model, convert_to_numpy=True, show_progress_bar=True, batch_size=batch_size)
    print(f"[INFO] Embeddings generados en {time.time()-t0:.1f}s. Shape: {embeddings.shape}")
    np.save(save_name, embeddings)
    print(f"[INFO] Guardado embeddings en: {save_name}")

# -------------------------
# 3) Codificar la consulta (usar prefijo para E5)
# -------------------------
if modelo == "e5":
    q_proc = "query: " + QUERY
    model_for_query = SentenceTransformer("intfloat/e5-base", device="cpu")
else:
    q_proc = QUERY
    model_for_query = SentenceTransformer("all-MiniLM-L6-v2", device="cpu")

print(f"[INFO] Codificando consulta con modelo '{modelo}': \"{QUERY}\"")
q_vec = model_for_query.encode([q_proc], convert_to_numpy=True)

# -------------------------
# 4) Similitud coseno y top-K
# -------------------------
sims = cosine_similarity(q_vec, embeddings)[0]
idxs = sims.argsort()[::-1][:TOP_K]

print("\n--- Resultados Top-{} ---".format(TOP_K))
for rank, idx in enumerate(idxs, start=1):
    score = sims[idx]
    texto = docs[idx]
    print(f"\nRank {rank} | índice {idx} | score {score:.4f}")
    print(texto[:500].replace("\n", " ") + "...")


[INFO] Documentos en memoria: 2000 (limitado a 2000)
[INFO] Encontrado archivo de embeddings para 'sbert': embeddings/sbert_embeddings.npy
[INFO] Codificando consulta con modelo 'sbert': "space exploration"

--- Resultados Top-5 ---

Rank 1 | índice 495 | score 0.4991
I am posting this for a friend without internet access. Please inquire to the phone number and address listed. ---------------------------------------------------------------------  "Space: Teaching's Newest Frontier" Sponsored by the Planetary Studies Foundation  The Planetary Studies Foundation is sponsoring a one week class for teachers called "Space: Teaching's Newest Frontier." The class will be held at the Sheraton Suites in Elk Grove, Illinois from June 14 through June 18. Participants wh...

Rank 2 | índice 1643 | score 0.4398
 Well, here goes.  The first item of business is to establish the importance space life sciences in the whole of scheme of humankind.  I mean compared to football and baseball, the average j