**Ejercicio 6: Introducción a Dense Retrieval**

**Nombre:** Aarón Yumancela

  **Objetivo de la práctica**

*   Generar embeddings con sentence-transformers (SBERT, E5), y recuperarlos





**Parte 0: Carga del Corpus**

In [1]:

# 0. Instalación de librerías

!pip install -q sentence-transformers faiss-cpu scikit-learn


[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.6/23.6 MB[0m [31m42.7 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:

# 1. Imports básicos

import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sentence_transformers import SentenceTransformer
import faiss


In [3]:

# 2. Parte 0: Carga del corpus 20 Newsgroups


newsgroups = fetch_20newsgroups(
    subset='train',
    remove=('headers', 'footers', 'quotes')
)

docs = newsgroups.data[:2000]  # lista de strings

print(f"Número de documentos en el corpus (usados): {len(docs)}")
print("Ejemplo de documento:\n")
print(docs[0][:500])


Número de documentos en el corpus (usados): 2000
Ejemplo de documento:

I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.


**Parte 2: Generación de Embeddings y Parte 3: Consulta**

In [4]:

# 3. Parte 2 (a): Embeddings con SBERT (all-MiniLM-L6-v2)


model_sbert = SentenceTransformer('all-MiniLM-L6-v2')

emb_sbert = model_sbert.encode(
    docs,
    show_progress_bar=True,
    convert_to_numpy=True,
    batch_size=32
)

print("Shape embeddings SBERT:", emb_sbert.shape)

# Normalizamos para que el producto interno sea similitud coseno
emb_sbert_norm = emb_sbert / np.linalg.norm(emb_sbert, axis=1, keepdims=True)

# Guardar a disco
np.save('embeddings_sbert.npy', emb_sbert_norm)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/63 [00:00<?, ?it/s]

Shape embeddings SBERT: (2000, 384)


In [5]:

# 4. Índice FAISS para SBERT (similitud coseno como dot-product)


d_sbert = emb_sbert_norm.shape[1]        # dimensión del embedding
index_sbert = faiss.IndexFlatIP(d_sbert) # IP = inner product
index_sbert.add(emb_sbert_norm)

print("Número de vectores en el índice SBERT:", index_sbert.ntotal)


Número de vectores en el índice SBERT: 2000


In [6]:

# 5. Parte 3 (a): Consulta + top-5 con SBERT


query = "space exploration"

# Codificamos la query
q_vec = model_sbert.encode([query], convert_to_numpy=True)
q_vec = q_vec / np.linalg.norm(q_vec, axis=1, keepdims=True)

# Búsqueda en FAISS (top-5)
D_sbert, I_sbert = index_sbert.search(q_vec, 5)

print("Scores SBERT:", D_sbert)
print("Índices SBERT:", I_sbert)

# Mostramos los documentos recuperados (primeros 500 chars)
for rank, idx in enumerate(I_sbert[0]):
    print(f"==== [SBERT] Documento {rank+1} (índice {idx}, score {D_sbert[0][rank]:.4f}) ====")
    print(docs[idx][:500].replace("\n", " "))
    print("\n" + "="*80 + "\n")


Scores SBERT: [[0.40292785 0.39050123 0.38140684 0.35752726 0.34078732]]
Índices SBERT: [[1665  545  579  467 1450]]
==== [SBERT] Documento 1 (índice 1665, score 0.4029) ====
For an essay, I am writing about the space shuttle and a need for a better propulsion system.  Through research, I have found that it is rather clumsy  (i.e. all the checks/tests before launch), the safety hazards ("sitting on a hydrogen bomb"), etc..  If you have any beefs about the current space shuttle program Re: propulsion, please send me your ideas.  Thanks a lot. 


==== [SBERT] Documento 2 (índice 545, score 0.3905) ====
Archive-name: space/astronaut Last-modified: $Date: 93/04/01 14:39:02 $  HOW TO BECOME AN ASTRONAUT      First the short form, authored by Henry Spencer, then an official NASA     announcement.      Q. How do I become an astronaut?      A. We will assume you mean a NASA astronaut, since it's probably     impossible for a non-Russian to get into the cosmonaut corps (paying     passengers ar

In [7]:

# 6. Parte 2 (b): Embeddings con E5 (intfloat/e5-base)
#    - "passage: " delante de cada documento


model_e5 = SentenceTransformer('intfloat/e5-base')

docs_passage = ["passage: " + d for d in docs]

emb_e5 = model_e5.encode(
    docs_passage,
    show_progress_bar=True,
    convert_to_numpy=True,
    batch_size=16
)

print("Shape embeddings E5:", emb_e5.shape)

# Normalizamos para similitud coseno
emb_e5_norm = emb_e5 / np.linalg.norm(emb_e5, axis=1, keepdims=True)

# Guardar a disco
np.save('embeddings_e5.npy', emb_e5_norm)


modules.json:   0%|          | 0.00/387 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/645 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/356 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/200 [00:00<?, ?B/s]

Batches:   0%|          | 0/125 [00:00<?, ?it/s]

Shape embeddings E5: (2000, 768)


In [8]:

# 7. Índice FAISS para E5


d_e5 = emb_e5_norm.shape[1]
index_e5 = faiss.IndexFlatIP(d_e5)
index_e5.add(emb_e5_norm)

print("Número de vectores en el índice E5:", index_e5.ntotal)


Número de vectores en el índice E5: 2000


In [9]:

# 8. Parte 3 (b): Consulta + top-5 con E5
#    - "query: " delante de la consulta


query = "space exploration"  # misma consulta que antes, si quieres comparar

query_e5 = "query: " + query

q_e5 = model_e5.encode([query_e5], convert_to_numpy=True)
q_e5 = q_e5 / np.linalg.norm(q_e5, axis=1, keepdims=True)

D_e5, I_e5 = index_e5.search(q_e5, 5)

print("Scores E5:", D_e5)
print("Índices E5:", I_e5)

for rank, idx in enumerate(I_e5[0]):
    print(f"==== [E5] Documento {rank+1} (índice {idx}, score {D_e5[0][rank]:.4f}) ====")
    print(docs[idx][:500].replace("\n", " "))
    print("\n" + "="*80 + "\n")


Scores E5: [[0.8182965  0.81018317 0.8098317  0.8073841  0.8050183 ]]
Índices E5: [[ 343 1665  533  953  762]]
==== [E5] Documento 1 (índice 343, score 0.8183) ====
 In fact, you probably want to avoid US Government anything for such a project.  The pricetag is invariably too high, either in money or in hassles.  The important thing to realize here is that the big cost of getting to the Moon is getting into low Earth orbit.  Everything else is practically down in the noise.  The only part of getting to the Moon that poses any new problems, beyond what you face in low orbit, is the last 10km -- the actual landing -- and that is not immensely difficult.  Of c


==== [E5] Documento 2 (índice 1665, score 0.8102) ====
For an essay, I am writing about the space shuttle and a need for a better propulsion system.  Through research, I have found that it is rather clumsy  (i.e. all the checks/tests before launch), the safety hazards ("sitting on a hydrogen bomb"), etc..  If you have any beefs ab