# Ejercicio 6: Dense Retrieval e Introducción a FAISS

## Objetivo de la práctica

Generar embeddings con sentence-transformers (SBERT, E5), e indexar documentos con FAISS 

## Parte 0: Carga del Corpus
### Actividad

1. Carga el corpus 20 Newsgroups desde sklearn.datasets.fetch_20newsgroups.
2. Limita el corpus a los primeros 2000 documentos para facilitar el procesamiento.

In [3]:
from sklearn.datasets import fetch_20newsgroups

newsgroups = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))
newsgroupsdocs = newsgroups.data

In [4]:
newsgroupsdocs = newsgroupsdocs[:2000]
labels = newsgroups.target[:2000]

In [9]:
!pip install -U sentence-transformers

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting sentence-transformers
  Downloading sentence_transformers-5.1.2-py3-none-any.whl.metadata (16 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_

## Parte 2: Generación de Embeddings
### Actividad

1. Usa dos modelos de sentence-transformers. Puedes usar: `'all-MiniLM-L6-v2'` (SBERT), o `'intfloat/e5-base'` (E5). Cuando uses E5, antepon `"passage: "` a cada documento antes de codificar.
2. Genera los vectores de embeddings para todos los documentos usando el modelo seleccionado.
3. Guarda los embeddings en un array de NumPy para su posterior indexación.

In [12]:
import numpy as np
from sentence_transformers import SentenceTransformer

sbert_model = SentenceTransformer('all-MiniLM-L6-v2')

embeddings_sbert = sbert_model.encode(
    newsgroupsdocs,
    batch_size=32,
    show_progress_bar=True
)

embeddings_sbert = np.array(embeddings_sbert)
embeddings_sbert.shape

Batches:   0%|          | 0/63 [00:00<?, ?it/s]

(2000, 384)

## Parte 3: Consulta
### Actividad

1. Escribe una consulta en lenguaje natural. Ejemplos:

    * "God, religion, and spirituality"
    * "space exploration"
    * "car maintenance"

2. Codifica la consulta utilizando el mismo modelo de embeddings. Cuando uses E5, antepon `"query: "` a la consulta.
3. Recupera los 5 documentos más relevantes con similitud coseno.
4. Muestra los textos de los documentos recuperados (puedes mostrar solo los primeros 500 caracteres de cada uno).

In [13]:
from sklearn.metrics.pairwise import cosine_similarity

def search(query_embedding, embeddings_matrix, top_k=5):
    sims = cosine_similarity([query_embedding], embeddings_matrix)[0]
    top_indices = np.argsort(sims)[::-1][:top_k]
    return top_indices, sims[top_indices]

In [38]:
query = "computer hardware"

In [39]:
query_embedding = sbert_model.encode(query)
top_idx, scores = search(query_embedding, embeddings_sbert, top_k=5)

In [42]:
for rank, (idx, score) in enumerate(zip(top_idx, scores), start=1):
    print(f"\n Documento {rank} — Similaridad: {score:.4f}")
    print(newsgroupsdocs[idx][:500])


 Documento 1 — Similaridad: 0.4841
In the next few months I am intending to build a 386 or 486 PC system
for remote monitoring. I would welcome any comments or advice you may
have on the choice of motherboard, HDDs and I/O boards. Recommendations
for good companies selling these would be a big help.

Many thanks,

Peter Green.



 Documento 2 — Similaridad: 0.4786
If anyone has any information about the upcoming new computers
(Cyclone and Tempest), I am in need of some info. Anything would be
greatly appreciated.

Thanks,

 Documento 3 — Similaridad: 0.4662
I guess the real question is:

Who asked the original questions, and why was it so _broad_.
Are we talking pure processing power (what kind of processing BTW)
isolated from every other factor and influence in the system?  
Or are we shopping for a home computer based on the CPU specs (yuck)!

I just finished a project that involves real-time processing of serial
data and discovered that the programming interface (assembly) has
_a l