# Ejercicio 6: Introducción a Dense Retrieval

## Objetivo de la práctica

Generar embeddings con sentence-transformers (SBERT, E5), y recuperarlos 

## Parte 0: Carga del Corpus
### Actividad

1. Carga el corpus 20 Newsgroups desde sklearn.datasets.fetch_20newsgroups.
2. Limita el corpus a los primeros 2000 documentos para facilitar el procesamiento.

In [1]:
from sklearn.datasets import fetch_20newsgroups

newsgroups = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))
newsgroupsdocs = newsgroups.data

In [2]:
num_documentos = len(newsgroupsdocs)
print(f"El dataset newsgroupsdocs tiene {num_documentos} documentos ")

El dataset newsgroupsdocs tiene 18846 documentos 


In [5]:
# Limitación corpus a los  primeros 2000 documentos
newsgroupsdocs = newsgroupsdocs[:2000]

In [6]:
# Contenido documento 0 
print(newsgroupsdocs[0])



I am sure some bashers of Pens fans are pretty confused about the lack
of any kind of posts about the recent Pens massacre of the Devils. Actually,
I am  bit puzzled too and a bit relieved. However, I am going to put an end
to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they
are killing those Devils worse than I thought. Jagr just showed you why
he is much better than his regular season stats. He is also a lot
fo fun to watch in the playoffs. Bowman should let JAgr have a lot of
fun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final
regular season game.          PENS RULE!!!




## Parte 2: Generación de Embeddings
### Actividad

1. Usa dos modelos de sentence-transformers. Puedes usar: `'all-MiniLM-L6-v2'` (SBERT), o `'intfloat/e5-base'` (E5). Cuando uses E5, antepon `"passage: "` a cada documento antes de codificar.
2. Genera los vectores de embeddings para todos los documentos usando el modelo seleccionado.
3. Guarda los embeddings en un array de NumPy para su posterior indexación.

In [8]:
from sentence_transformers import SentenceTransformer
import numpy as np

# Modelo SBERT
model_name = "all-MiniLM-L6-v2"
model = SentenceTransformer(model_name)

# Generamos embeddings para todos los documentos
embeddings = model.encode(newsgroupsdocs, show_progress_bar=True)

# Guardarmos como array NumPy
embeddings = np.array(embeddings)

print("Shape de los embeddings:", embeddings.shape)

  from .autonotebook import tqdm as notebook_tqdm
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Batches: 100%|██████████| 63/63 [00:48<00:00,  1.29it/s]

Shape de los embeddings: (2000, 384)





## Parte 3: Consulta
### Actividad

1. Escribe una consulta en lenguaje natural. Ejemplos:

    * "God, religion, and spirituality"
    * "space exploration"
    * "car maintenance"

2. Codifica la consulta utilizando el mismo modelo de embeddings. Cuando uses E5, antepon `"query: "` a la consulta.
3. Recupera los 5 documentos más relevantes con similitud coseno.
4. Muestra los textos de los documentos recuperados (puedes mostrar solo los primeros 500 caracteres de cada uno).

In [22]:
from sklearn.metrics.pairwise import cosine_similarity

# Consulta 
query = "space exploration"
print(f"Consulta: {query}")
 
query_embedding = model.encode(query)


# Recuperamos los 5 documentos más relevantes usando similitud coseno
similarities = cosine_similarity([query_embedding], embeddings)[0]

# Obtenemos los índices de los 5 documentos más similares
top_5 = similarities.argsort()[-5:][::-1]

# Resultado
print(f"\nTop 5 documentos más relevantes para: '{query}'\n")
for i, idx in enumerate(top_5, 1):
    print("="*80)
    print(f"\n{i}. Documento {idx} (Similitud: {similarities[idx]:.4f})")
    print(newsgroupsdocs[idx][:500])  
    

Consulta: space exploration

Top 5 documentos más relevantes para: 'space exploration'


1. Documento 495 (Similitud: 0.4991)
I am posting this for a friend without internet access. Please inquire
to the phone number and address listed.
---------------------------------------------------------------------

"Space: Teaching's Newest Frontier"
Sponsored by the Planetary Studies Foundation

The Planetary Studies Foundation is sponsoring a one week class for
teachers called "Space: Teaching's Newest Frontier." The class will be
held at the Sheraton Suites in Elk Grove, Illinois from June 14 through
June 18. Participants wh

2. Documento 1643 (Similitud: 0.4398)

Well, here goes.

The first item of business is to establish the importance space life
sciences in the whole of scheme of humankind.  I mean compared
to football and baseball, the average joe schmoe doesn't seem interested
or even curious about spaceflight.  I think that this forum can
make a major change in that lack of insight an