# Ejercicio 6: Dense Retrieval e Introducción a FAISS

## Objetivo de la práctica

Generar embeddings con sentence-transformers (SBERT, E5), e indexar documentos con FAISS 

## Parte 0: Carga del Corpus
### Actividad

1. Carga el corpus 20 Newsgroups desde sklearn.datasets.fetch_20newsgroups.
2. Limita el corpus a los primeros 2000 documentos para facilitar el procesamiento.

In [3]:
from sklearn.datasets import fetch_20newsgroups
newsgroups = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))

# Limitar a los primeros 2000 documentos
newsgroupsdocs = newsgroups.data[:2000]

## Parte 2: Generación de Embeddings
### Actividad

1. Usa dos modelos de sentence-transformers. Puedes usar: `'all-MiniLM-L6-v2'` (SBERT), o `'intfloat/e5-base'` (E5). Cuando uses E5, antepon `"passage: "` a cada documento antes de codificar.
2. Genera los vectores de embeddings para todos los documentos usando el modelo seleccionado.
3. Guarda los embeddings en un array de NumPy para su posterior indexación.

### MODELO SBERT

In [4]:
import pandas as pd
df =pd.DataFrame({'doc':newsgroupsdocs[:2000]})
df.head()

Unnamed: 0,doc
0,\n\nI am sure some bashers of Pens fans are pr...
1,My brother is in the market for a high-perform...
2,\n\n\n\n\tFinally you said what you dream abou...
3,\nThink!\n\nIt's the SCSI card doing the DMA t...
4,1) I have an old Jasmine drive which I cann...


In [5]:
from sentence_transformers import SentenceTransformer

# Cargar el modelo preentrenado
model = SentenceTransformer('all-MiniLM-L6-v2')

# Asegurarse de tener exactamente 2000 documentos
documentos = newsgroupsdocs[:2000]

# Generar embeddings 
document_embeddings = model.encode(documentos, show_progress_bar=True)

print(f"Shape de los embeddings: {document_embeddings.shape}")


  from .autonotebook import tqdm as notebook_tqdm
Batches: 100%|██████████| 63/63 [00:54<00:00,  1.15it/s]

Shape de los embeddings: (2000, 384)





In [6]:
import pandas as pd

# Crear DataFrame si no lo hiciste antes
df = pd.DataFrame({'documento': documentos})

# Agregar los embeddings
df['embedding'] = list(document_embeddings)

# Mostrar primeras filas
df.head()


Unnamed: 0,documento,embedding
0,\n\nI am sure some bashers of Pens fans are pr...,"[0.0020780046, 0.023450432, 0.024808863, -0.01..."
1,My brother is in the market for a high-perform...,"[0.050060306, 0.026980933, -0.008864836, -0.03..."
2,\n\n\n\n\tFinally you said what you dream abou...,"[0.016404754, 0.08100051, -0.049535964, -0.008..."
3,\nThink!\n\nIt's the SCSI card doing the DMA t...,"[-0.019391475, 0.0114943655, -0.014787269, -0...."
4,1) I have an old Jasmine drive which I cann...,"[-0.039287075, -0.055402867, -0.07453619, -0.0..."


### MODELO E5

In [7]:
from sentence_transformers import SentenceTransformer

# Cargar el modelo E5-base
model = SentenceTransformer("intfloat/e5-base")

# Para cada documento, debemos anteponer "passage: " al texto
documentos_e5 = ["passage: " + doc for doc in newsgroupsdocs[:2000]]

# Generar embeddings
embeddings_e5 = model.encode(documentos_e5, show_progress_bar=True)

print(f"Forma de los embeddings E5: {embeddings_e5.shape}")  # (2000, 768)


Batches: 100%|██████████| 63/63 [11:09<00:00, 10.62s/it]

Forma de los embeddings E5: (2000, 768)





In [8]:
import pandas as pd

df = pd.DataFrame({
    'documento': newsgroupsdocs[:2000],
    'embedding_SBERT': list(document_embeddings),
    'embedding_e5': list(embeddings_e5)
})

df.head()


Unnamed: 0,documento,embedding_SBERT,embedding_e5
0,\n\nI am sure some bashers of Pens fans are pr...,"[0.0020780046, 0.023450432, 0.024808863, -0.01...","[-0.05799896, -0.0020638704, -0.020161983, -0...."
1,My brother is in the market for a high-perform...,"[0.050060306, 0.026980933, -0.008864836, -0.03...","[-0.047147322, 0.00045925583, 0.024559252, -0...."
2,\n\n\n\n\tFinally you said what you dream abou...,"[0.016404754, 0.08100051, -0.049535964, -0.008...","[-0.032370448, 0.024496663, -0.019904086, -0.0..."
3,\nThink!\n\nIt's the SCSI card doing the DMA t...,"[-0.019391475, 0.0114943655, -0.014787269, -0....","[-0.077318035, 0.017821243, -0.0042953906, -0...."
4,1) I have an old Jasmine drive which I cann...,"[-0.039287075, -0.055402867, -0.07453619, -0.0...","[-0.03879634, 0.0034529453, -0.018807113, 0.00..."


## Parte 3: Indexación con FAISS
### Actividad

1. Crea un índice plano con faiss.IndexFlatL2 para búsquedas por distancia euclidiana.
2. Asegúrate de usar la dimensión correcta `(embedding_dim = doc_embeddings.shape[1])`.
3. Agrega los vectores de documentos al índice.

In [11]:
import pandas as pd
import faiss
import numpy as np

# 1. Convertir embeddings a float32
embeddings = np.array(document_embeddings).astype('float32')

# 2. Obtener dimensión de los vectores
embedding_dim = embeddings.shape[1]

# 3. Crear índice FAISS (L2 = distancia euclidiana)
index = faiss.IndexFlatL2(embedding_dim)

# 4. Agregar vectores al índice
index.add(embeddings)

# 5. Consulta: buscar los 5 documentos más similares al documento 0
query_vector = embeddings[0:1]
top_k = 5
distances, indices = index.search(query_vector, k=top_k)

# 6. Crear DataFrame con resultados
resultados = pd.DataFrame({
    'indice_documento': indices[0],
    'distancia': distances[0],
    'contenido': [newsgroupsdocs[i] for i in indices[0]]
})

# 7. Mostrar el DataFrame ordenado
resultados


Unnamed: 0,indice_documento,distancia,contenido
0,0,0.0,\n\nI am sure some bashers of Pens fans are pr...
1,629,0.61691,\n\nBowman tended to overplay Francis at times...
2,1803,0.791276,Just some thoughts:\n\nI don't usually like to...
3,458,0.853428,"\n\nAttention Penguins fans once again, appare..."
4,1921,0.912488,"\nYeh,but :\n\n1.Biran Sutter's playoff record..."


## Parte 4: Consulta Semántica
### Actividad

1. Escribe una consulta en lenguaje natural. Ejemplos:

    * "God, religion, and spirituality"
    * "space exploration"
    * "car maintenance"

2. Codifica la consulta utilizando el mismo modelo de embeddings. Cuando uses E5, antepon `"query: "` a la consulta.
3. Recupera los 5 documentos más relevantes con `index.search(...)`.
4. Muestra los textos de los documentos recuperados (puedes mostrar solo los primeros 500 caracteres de cada uno).

In [13]:
import faiss
import numpy as np

# Paso 1: Escribe tu consulta
consulta = "space exploration"  # Puedes cambiar esta consulta

# Paso 2: Anteponer "query: " como lo requiere el modelo E5
consulta_formateada = "query: " + consulta

# Paso 3: Codificar la consulta con el mismo modelo usado antes
query_embedding = model.encode([consulta_formateada]).astype('float32')  # modelo E5 ya cargado

# Paso 4: Crear un índice FAISS con embeddings_e5 si no existe
embedding_dim_e5 = embeddings_e5.shape[1]
index_e5 = faiss.IndexFlatL2(embedding_dim_e5)
index_e5.add(embeddings_e5)

# Paso 5: Buscar los 5 documentos más relevantes usando FAISS
top_k = 5
distancias, indices = index_e5.search(query_embedding, k=top_k)

# Paso 6: Mostrar los resultados en un DataFrame
df_resultados = pd.DataFrame({
    'indice_documento': indices[0],
    'distancia': distancias[0],
    'contenido_resumido': [newsgroupsdocs[i][:500] for i in indices[0]]
})

# Mostrar DataFrame ordenado por similitud
df_resultados


Unnamed: 0,indice_documento,distancia,contenido_resumido
0,25,0.361933,AW&ST had a brief blurb on a Manned Lunar Exp...
1,1643,0.364152,"\nWell, here goes.\n\nThe first item of busine..."
2,495,0.370052,I am posting this for a friend without interne...
3,390,0.381102,As for SF and advertising in space. There is a...
4,784,0.383229,"\nWhatabout, Schools, Universities, Rich Indiv..."
