# Stiven Saldaña

# Ejercicio 6: Introducción a Dense Retrieval
## Objetivo de la práctica
Generar embeddings con sentence-transformers (SBERT, E5), y recuperarlos

## Parte 0: Carga del Corpus
## Actividad
1. Carga el corpus 20 Newsgroups desde sklearn.datasets.fetch_20newsgroups.
2. Limita el corpus a los primeros 2000 documentos para facilitar el procesamiento

In [1]:
# se instala la libreria para usar los transformers
!pip install sentence-transformers scikit-learn numpy



In [9]:
# se importan librerias necesarias
import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer

In [10]:
# Se puede escoger entre usar el modelo SEBERT O E5
# unicamente se descomenta y comenta la opcion que se
# desee usar
model_name = 'all-MiniLM-L6-v2'
# model_name = 'intfloat/e5-base'
model = SentenceTransformer(model_name)

In [11]:
# Se carga el dataset excepto los encabezados
newsgroups = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))
docs_full = newsgroups.data
# Se cargan los primeros 2000 documentos
docs = docs_full[:2000]

In [38]:
docs[:5]

["\n\nI am sure some bashers of Pens fans are pretty confused about the lack\nof any kind of posts about the recent Pens massacre of the Devils. Actually,\nI am  bit puzzled too and a bit relieved. However, I am going to put an end\nto non-PIttsburghers' relief with a bit of praise for the Pens. Man, they\nare killing those Devils worse than I thought. Jagr just showed you why\nhe is much better than his regular season stats. He is also a lot\nfo fun to watch in the playoffs. Bowman should let JAgr have a lot of\nfun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final\nregular season game.          PENS RULE!!!\n\n",
 'My brother is in the market for a high-performance video card that supports\nVESA local bus with 1-2MB RAM.  Does anyone have suggestions/ideas on:\n\n  - Diamond Stealth Pro Local Bus\n\n  - Orchid Farenheit 1280\n\n  - ATI Graphics Ultra Pro\n\n  - Any other high-per

## Parte 2: Generación de Embeddings
## Actividad
1. Usa dos modelos de sentence-transformers. Puedes usar: 'all-MiniLM-L6-v2' (SBERT), o 'intfloat/e5-base' (E5). Cuando uses E5, antepon "passage: " a cada documento antes de codificar.
2. Genera los vectores de embeddings para todos los documentos usando el modelo seleccionado.
3. Guarda los embeddings en un array de NumPy para su posterior indexación.

In [None]:

# condicional en caso de esoger SEBERT o E5
texts_to_encode = docs
if "e5" in model_name:
    texts_to_encode = ["passage: " + doc for doc in docs]

# generacion de los vectores
doc_embeddings = model.encode(texts_to_encode, show_progress_bar=True)

# se guarda los embedings en un array  NumPy
doc_embeddings_np = np.array(doc_embeddings)
print(f"--> Embeddings generados. Forma del array: {doc_embeddings_np.shape}")

## Parte 3: Consulta
## Actividad
1. Escribe una consulta en lenguaje natural. Ejemplos:
* "God, religion, and spirituality"
* "space exploration"
* "car maintenance"
2. Codifica la consulta utilizando el mismo modelo de embeddings. Cuando uses E5, antepon "query: " a la consulta.

3. Recupera los 5 documentos más relevantes con similitud coseno.

4. Muestra los textos de los documentos recuperados (puedes mostrar solo los primeros 500 caracteres de cada uno).

In [14]:
# escritura de la consulta
query_text = "space exploration"

# logica que se usa en caso de usar E5
query_input = query_text
if "e5" in model_name:
    query_input = "query: " + query_text

# se codifica la consulta
query_embedding = model.encode([query_input])

# se calcula la similitud coseno entre la consulta y los documentos
# se debuelve una matriz y toma la primera fila
similarities = cosine_similarity(query_embedding, doc_embeddings_np)[0]

# se toma los los 5 documentos
top_k = 5
top_results_indices = np.argsort(similarities)[::-1][:top_k]

In [20]:
print(f"\nResultados para: SEBERT con la consulta: '{query_text}'\n" + "="*40)

for rank, index in enumerate(top_results_indices):
    score = similarities[index]
    content = docs[index]

    print(f"\nTop {rank+1} (Score: {score:.4f})")
    print("-" * 20)
    # Mostramos solo los primeros 500 caracteres
    print(content[:500].strip())


Resultados para: SEBERT con la consulta: 'space exploration'

Top 1 (Score: 0.4991)
--------------------
I am posting this for a friend without internet access. Please inquire
to the phone number and address listed.
---------------------------------------------------------------------

"Space: Teaching's Newest Frontier"
Sponsored by the Planetary Studies Foundation

The Planetary Studies Foundation is sponsoring a one week class for
teachers called "Space: Teaching's Newest Frontier." The class will be
held at the Sheraton Suites in Elk Grove, Illinois from June 14 through
June 18. Participants wh

Top 2 (Score: 0.4398)
--------------------
Well, here goes.

The first item of business is to establish the importance space life
sciences in the whole of scheme of humankind.  I mean compared
to football and baseball, the average joe schmoe doesn't seem interested
or even curious about spaceflight.  I think that this forum can
make a major change in that lack of insight and education.

Al

# USANDO E5

## Selección del modelo

In [None]:
# Se puede escoger entre usar el modelo SEBERT O E5
# unicamente se descomenta y comenta la opcion que se desee usar
# model_name = 'all-MiniLM-L6-v2'
model_name = 'intfloat/e5-base'
model = SentenceTransformer(model_name)

# Paso 0

In [22]:
# Se carga el dataset excepto los encabezados
newsgroups = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))
docs_full = newsgroups.data
# Se cargan los primeros 2000 documentos
docs = docs_full[:2000]

In [39]:
docs[:5]

["\n\nI am sure some bashers of Pens fans are pretty confused about the lack\nof any kind of posts about the recent Pens massacre of the Devils. Actually,\nI am  bit puzzled too and a bit relieved. However, I am going to put an end\nto non-PIttsburghers' relief with a bit of praise for the Pens. Man, they\nare killing those Devils worse than I thought. Jagr just showed you why\nhe is much better than his regular season stats. He is also a lot\nfo fun to watch in the playoffs. Bowman should let JAgr have a lot of\nfun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final\nregular season game.          PENS RULE!!!\n\n",
 'My brother is in the market for a high-performance video card that supports\nVESA local bus with 1-2MB RAM.  Does anyone have suggestions/ideas on:\n\n  - Diamond Stealth Pro Local Bus\n\n  - Orchid Farenheit 1280\n\n  - ATI Graphics Ultra Pro\n\n  - Any other high-per

# Paso 2

In [None]:
# condicional en caso de esoger SEBERT o E5
texts_to_encode = docs
if "e5" in model_name:
    texts_to_encode = ["passage: " + doc for doc in docs]

# generacion de los vectores
doc_embeddings = model.encode(texts_to_encode, show_progress_bar=True)

# se guarda los embedings en un array  NumPy
doc_embeddings_np = np.array(doc_embeddings)
print(f"--> Embeddings generados. Forma del array: {doc_embeddings_np.shape}")

# Paso 3

In [30]:
# escritura de la consulta
query_text2 = "space exploration"

# logica que se usa en caso de usar E5
query_input2 = query_text2
if "e5" in model_name:
    query_input2 = "query: " + query_text2

# se codifica la consulta
query_embedding = model.encode([query_input2])

# se calcula la similitud coseno entre la consulta y los documentos
similarities = cosine_similarity(query_embedding, doc_embeddings_np)[0]

# se toma los 5 documentos
top_k = 5
top_results_indices = np.argsort(similarities)[::-1][:top_k]

# Reaultados

In [32]:
print(f"\nResultados para E5: '{query_text2}'\n" + "="*40)

for rank, index in enumerate(top_results_indices):
    score2 = similarities[index]
    content2 = docs[index]

    print(f"\nTop {rank+1} (Score: {score2:.4f})")
    print("-" * 20)
    print(content2[:500].strip())


Resultados para E5: 'space exploration'

Top 1 (Score: 0.8190)
--------------------
AW&ST  had a brief blurb on a Manned Lunar Exploration confernce
May 7th  at Crystal City Virginia, under the auspices of AIAA.

Does anyone know more about this?  How much, to attend????

Anyone want to go?

Top 2 (Score: 0.8179)
--------------------
Well, here goes.

The first item of business is to establish the importance space life
sciences in the whole of scheme of humankind.  I mean compared
to football and baseball, the average joe schmoe doesn't seem interested
or even curious about spaceflight.  I think that this forum can
make a major change in that lack of insight and education.

All of us, in our own way, can contribute to a comprehensive document
which can be released to the general public around the world.  The
document would

Top 3 (Score: 0.8150)
--------------------
I am posting this for a friend without internet access. Please inquire
to the phone number and address listed.
---------