# Ejercicio 6: Dense Retrieval e Introducción a FAISS

## Objetivo de la práctica

Generar embeddings con sentence-transformers (SBERT, E5), e indexar documentos con FAISS 

## Parte 0: Carga del Corpus
### Actividad

1. Carga el corpus 20 Newsgroups desde sklearn.datasets.fetch_20newsgroups.
2. Limita el corpus a los primeros 2000 documentos para facilitar el procesamiento.

In [86]:
from sklearn.datasets import fetch_20newsgroups

newsgroups = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))
newsgroupsdocs = newsgroups.data

first_2000_docs = newsgroupsdocs[:2000]


## Parte 2: Generación de Embeddings
### Actividad

1. Usa dos modelos de sentence-transformers. Puedes usar: `'all-MiniLM-L6-v2'` (SBERT), o `'intfloat/e5-base'` (E5). Cuando uses E5, antepon `"passage: "` a cada documento antes de codificar.
2. Genera los vectores de embeddings para todos los documentos usando el modelo seleccionado.
3. Guarda los embeddings en un array de NumPy para su posterior indexación.

In [87]:
!pip install -U sentence-transformers

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [88]:
print(newsgroupsdocs[:5])

["\n\nI am sure some bashers of Pens fans are pretty confused about the lack\nof any kind of posts about the recent Pens massacre of the Devils. Actually,\nI am  bit puzzled too and a bit relieved. However, I am going to put an end\nto non-PIttsburghers' relief with a bit of praise for the Pens. Man, they\nare killing those Devils worse than I thought. Jagr just showed you why\nhe is much better than his regular season stats. He is also a lot\nfo fun to watch in the playoffs. Bowman should let JAgr have a lot of\nfun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final\nregular season game.          PENS RULE!!!\n\n", 'My brother is in the market for a high-performance video card that supports\nVESA local bus with 1-2MB RAM.  Does anyone have suggestions/ideas on:\n\n  - Diamond Stealth Pro Local Bus\n\n  - Orchid Farenheit 1280\n\n  - ATI Graphics Ultra Pro\n\n  - Any other high-perf

In [89]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

embeddings = model.encode(first_2000_docs, convert_to_numpy=True)

print(embeddings)



[[ 0.00207802  0.02345034  0.0248088  ...  0.00143588  0.0151075
   0.05287576]
 [ 0.05006028  0.02698098 -0.00886481 ... -0.00887172 -0.06737079
   0.05656356]
 [ 0.01640473  0.08100048 -0.04953602 ... -0.04184625 -0.07800215
  -0.03130955]
 ...
 [-0.0748926  -0.00042234  0.01527547 ... -0.12211472 -0.02859254
   0.05603697]
 [ 0.09780728  0.04209511 -0.06449971 ... -0.03027233  0.08681752
   0.01879652]
 [ 0.04761754 -0.01735131 -0.02222623 ... -0.02176011  0.0030258
   0.0142307 ]]


In [90]:
embeddings.shape
embeddings.shape[1]



384

## Parte 3: Indexación con FAISS
### Actividad

1. Crea un índice plano con faiss.IndexFlatL2 para búsquedas por distancia euclidiana.
2. Asegúrate de usar la dimensión correcta `(embedding_dim = doc_embeddings.shape[1])`.
3. Agrega los vectores de documentos al índice.

In [91]:
!pip install faiss-cpu

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [92]:
import faiss

index = faiss.IndexFlatL2(embeddings.shape[1])

index.is_trained

index.add(embeddings)


        

In [93]:
import pandas as pd

df_embeddings = pd.DataFrame({
    'document': first_2000_docs,
    'embedding': list(embeddings)
})

df_embeddings.head()

Unnamed: 0,document,embedding
0,\n\nI am sure some bashers of Pens fans are pr...,"[0.0020780163, 0.023450343, 0.0248088, -0.0101..."
1,My brother is in the market for a high-perform...,"[0.05006028, 0.026980976, -0.008864814, -0.035..."
2,\n\n\n\n\tFinally you said what you dream abou...,"[0.016404726, 0.081000485, -0.04953602, -0.008..."
3,\nThink!\n\nIt's the SCSI card doing the DMA t...,"[-0.019391453, 0.011494373, -0.014787207, -0.0..."
4,1) I have an old Jasmine drive which I cann...,"[-0.039287087, -0.055402797, -0.074536145, -0...."


## Parte 4: Consulta Semántica
### Actividad

1. Escribe una consulta en lenguaje natural. Ejemplos:

    * "God, religion, and spirituality"
    * "space exploration"
    * "car maintenance"

2. Codifica la consulta utilizando el mismo modelo de embeddings. Cuando uses E5, antepon `"query: "` a la consulta.
3. Recupera los 5 documentos más relevantes con `index.search(...)`.
4. Muestra los textos de los documentos recuperados (puedes mostrar solo los primeros 500 caracteres de cada uno).

In [94]:
query = "God, religion and spirituality"
top_k = 5

In [95]:
query_embedding = model.encode([query], convert_to_numpy=True)

distances, indices = index.search(query_embedding, top_k)

print(df_embeddings.iloc[indices.flatten()]['document'])

996                \n\n\n\nHumanist, or sub-humanist? :-)
282     \nI didn't know God was a secular humanist...\...
791     Above all, love each other deeply, because lov...
677      \n(Deletion)\n \nFor me, it is a "I believe n...
1129    \n\nIt is all written in _The_Wholly_Babble:_t...
Name: document, dtype: object
