# Ejercicio 6: Dense Retrieval e Introducción a FAISS

## Objetivo de la práctica

Generar embeddings con sentence-transformers (SBERT, E5), e indexar documentos con FAISS 

In [15]:
!pip install sentence-transformers faiss-cpu

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




## Parte 1: Carga del Corpus
### Actividad

1. Carga el corpus 20 Newsgroups desde sklearn.datasets.fetch_20newsgroups.
2. Limita el corpus a los primeros 2000 documentos para facilitar el procesamiento.

In [16]:
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_20newsgroups
from sentence_transformers import SentenceTransformer
import faiss

newsgroups = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))

documents = newsgroups.data[:2000]

print(f"Cantidad de documentos cargados: {len(documents)}")
print("Ejemplo de documento:\n", documents[0][:200])

Cantidad de documentos cargados: 2000
Ejemplo de documento:
 

I am sure some bashers of Pens fans are pretty confused about the lack
of any kind of posts about the recent Pens massacre of the Devils. Actually,
I am  bit puzzled too and a bit relieved. However,


## Parte 2: Generación de Embeddings
### Actividad

1. Usa dos modelos de sentence-transformers. Puedes usar: `'all-MiniLM-L6-v2'` (SBERT), o `'intfloat/e5-base'` (E5). Cuando uses E5, antepon `"passage: "` a cada documento antes de codificar.
2. Genera los vectores de embeddings para todos los documentos usando el modelo seleccionado.
3. Guarda los embeddings en un array de NumPy para su posterior indexación.

In [17]:
model_name = 'all-MiniLM-L6-v2' 
# model_name = 'intfloat/e5-base'  # usar E5

print(f"Cargando modelo: {model_name}...")
model = SentenceTransformer(model_name)

if 'e5' in model_name:
    docs_to_encode = ["passage: " + doc for doc in documents]
else:
    docs_to_encode = documents
    
print("Generando embeddings")
embeddings = model.encode(docs_to_encode, show_progress_bar=True)

embeddings = np.array(embeddings).astype('float32')

print(f"Dimensiones: {embeddings.shape}")

Cargando modelo: all-MiniLM-L6-v2...
Generando embeddings


Batches:   0%|          | 0/63 [00:00<?, ?it/s]

Dimensiones: (2000, 384)


## Parte 3: Consulta
### Actividad

1. Escribe una consulta en lenguaje natural. Ejemplos:

    * "God, religion, and spirituality"
    * "space exploration"
    * "car maintenance"

2. Codifica la consulta utilizando el mismo modelo de embeddings. Cuando uses E5, antepon `"query: "` a la consulta.
3. Recupera los 5 documentos más relevantes con similitud coseno.
4. Muestra los textos de los documentos recuperados (puedes mostrar solo los primeros 500 caracteres de cada uno).

In [18]:

faiss.normalize_L2(embeddings)

dimension = embeddings.shape[1]
index = faiss.IndexFlatIP(dimension)
index.add(embeddings)

query_text = "Stanislav petrov" 

print(f"\nConsulta: '{query_text}'")

if 'e5' in model_name:
    query_input = "query: " + query_text
else:
    query_input = query_text

query_vector = model.encode([query_input])

faiss.normalize_L2(query_vector)

k = 5
D, I = index.search(query_vector, k) 

for i in range(k):
    doc_id = I[0][i]
    score = D[0][i]
    print(f"\nResultado {i+1} (Score: {score:.4f}):")
    print(documents[doc_id][:500])
    


Consulta: 'Stanislav petrov'

Resultado 1 (Score: 0.3612):

I saw Messier and Leetch shooting at a camera on Letterman(?).  I
could have been any show though, since I watch NONE of those late
night shows very regularly.
					-John Santore

Philadelphia Flyers in '93-'94! 

 ____________________                                
/                    \                   "We break the surface tension 
\_________     ____   \                   with our wild kinetic dreams"
/        

Resultado 2 (Score: 0.3397):

That was Clint Malarchuk.  That was a very dangerous accident.  He could he
died right there on the ice.  However, he has played since  
but I don't know where he is now.  I think he is still playing but I'm
not positive.  He was a Sabre at the time.
I don't know who skated into him though.


I remember a couple of seasons before the Malarchuk incident Borje
Salming of Toronto fell down in the crease and someone skated into
his face.  That took a lot of stiches to fix.


Resultado 