# Ejercicio 6: Dense Retrieval e Introducción a FAISS

## Objetivo de la práctica

Generar embeddings con sentence-transformers (SBERT, E5), e indexar documentos con FAISS 

## Parte 0: Carga del Corpus
### Actividad

1. Carga el corpus 20 Newsgroups desde sklearn.datasets.fetch_20newsgroups.
2. Limita el corpus a los primeros 2000 documentos para facilitar el procesamiento.

In [8]:
from sklearn.datasets import fetch_20newsgroups
import pandas as pd
newsgroups = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))
newsgroupsdocs = pd.DataFrame(newsgroups.data[:2000], columns=["documento"])
newsgroupsdocs 

Unnamed: 0,documento
0,\n\nI am sure some bashers of Pens fans are pr...
1,My brother is in the market for a high-perform...
2,\n\n\n\n\tFinally you said what you dream abou...
3,\nThink!\n\nIt's the SCSI card doing the DMA t...
4,1) I have an old Jasmine drive which I cann...
...,...
1995,"Oakland, California, Sunday, April 25th, 1:05 ..."
1996,"\n\nNo matter how ""absurd"" it is to suggest th..."
1997,Anyone here know if NCD is doing educational p...
1998,"\ntoo bad he doesn't bring the ability to hit,..."


## Parte 2: Generación de Embeddings
### Actividad

1. Usa dos modelos de sentence-transformers. Puedes usar: `'all-MiniLM-L6-v2'` (SBERT), o `'intfloat/e5-base'` (E5). Cuando uses E5, antepon `"passage: "` a cada documento antes de codificar.
2. Genera los vectores de embeddings para todos los documentos usando el modelo seleccionado.
3. Guarda los embeddings en un array de NumPy para su posterior indexación.

In [7]:
from sentence_transformers import SentenceTransformer
import numpy as np
# 1. Cargar modelo SBERT
model_sbert = SentenceTransformer('all-MiniLM-L6-v2')
# 2. Convertir documentos a lista
docs = newsgroupsdocs["documento"].tolist()
# 3. Generar embeddings
embeddings_sbert = model_sbert.encode(docs, show_progress_bar=True)
# 4. Guardar a archivo NumPy
np.save("embeddings_sbert.npy", embeddings_sbert)

2025-11-26 17:18:28.411438: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1764177508.640914      47 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1764177508.709461      47 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/63 [00:00<?, ?it/s]

In [9]:
print(embeddings_sbert.shape)

(2000, 384)


In [10]:
# Embedding completo del primer documento
print(embeddings_sbert[0])

[ 2.07797508e-03  2.34504342e-02  2.48088446e-02 -1.01102032e-02
  4.62613665e-02 -1.90387983e-02  6.19882718e-02  4.91665825e-02
  2.65861563e-02 -9.34641715e-03 -9.95098427e-02  3.97232622e-02
 -5.52095957e-02  2.53241733e-02  2.99360473e-02 -1.95666235e-02
 -6.08660653e-02  1.58701483e-02  2.53339261e-02  4.59387191e-02
 -1.41414236e-02 -7.94888381e-03  2.13751514e-02 -1.01096248e-02
  1.00899912e-01  1.32258311e-02  9.94407851e-03  6.49844036e-02
  3.59497592e-02  9.01051424e-03 -4.93551828e-02  2.84282714e-02
  1.66624710e-02 -7.03645423e-02  2.88974401e-02 -1.27835041e-02
 -1.62345972e-02 -2.95970589e-02  1.19796582e-03  1.47519559e-02
  3.27470824e-02  3.27007510e-02 -5.38815521e-02 -3.43445577e-02
  3.88207249e-02 -1.28942290e-02 -5.78634180e-02 -5.05731702e-02
  3.64844613e-02 -1.85512751e-02 -1.09562436e-02 -2.36339495e-02
  8.50375742e-02 -9.82703269e-02  3.15816104e-02  4.19593230e-02
 -2.14829370e-02 -4.29301336e-02  5.39161190e-02 -5.95207065e-02
  1.01381391e-02 -3.24808

In [14]:
import os
os.listdir()

['.virtual_documents', 'embeddings_sbert.npy']

## Parte 3: Consulta
### Actividad

1. Escribe una consulta en lenguaje natural. Ejemplos:

    * "God, religion, and spirituality"
    * "space exploration"
    * "car maintenance"

2. Codifica la consulta utilizando el mismo modelo de embeddings. Cuando uses E5, antepon `"query: "` a la consulta.
3. Recupera los 5 documentos más relevantes con similitud coseno.
4. Muestra los textos de los documentos recuperados (puedes mostrar solo los primeros 500 caracteres de cada uno).

In [18]:
# Consulta
query = "space exploration"
# 1. Codificar consulta
query_embedding = model_sbert.encode([query])

In [20]:
from sklearn.metrics.pairwise import cosine_similarity

# 2. Similitud coseno
similarities = cosine_similarity(query_embedding, embeddings_sbert)[0]

# 3. Obtener 5 documentos mas similares
top5_idx = np.argsort(similarities)[::-1][:5]

# 4. Mostrar resultados
for i, idx in enumerate(top5_idx):
    print(f"\nDocumento #{i+1} — Índice {idx}")
    print(newsgroupsdocs["documento"][idx][:500])  # primeros 500 caracteres
    print("\nSimilitud:", similarities[idx])



Documento #1 — Índice 495
I am posting this for a friend without internet access. Please inquire
to the phone number and address listed.
---------------------------------------------------------------------

"Space: Teaching's Newest Frontier"
Sponsored by the Planetary Studies Foundation

The Planetary Studies Foundation is sponsoring a one week class for
teachers called "Space: Teaching's Newest Frontier." The class will be
held at the Sheraton Suites in Elk Grove, Illinois from June 14 through
June 18. Participants wh

Similitud: 0.49910414

Documento #2 — Índice 1643

Well, here goes.

The first item of business is to establish the importance space life
sciences in the whole of scheme of humankind.  I mean compared
to football and baseball, the average joe schmoe doesn't seem interested
or even curious about spaceflight.  I think that this forum can
make a major change in that lack of insight and education.

All of us, in our own way, can contribute to a comprehensive document
whi