# Ejercicio 6: Dense Retrieval e Introducción a FAISS

## Objetivo de la práctica

Generar embeddings con sentence-transformers (SBERT, E5), e indexar documentos con FAISS

Nombre: Darlin Anacicha

## Parte 0: Carga del Corpus
### Actividad

1. Carga el corpus 20 Newsgroups desde sklearn.datasets.fetch_20newsgroups.
2. Limita el corpus a los primeros 2000 documentos para facilitar el procesamiento.

In [1]:
from sklearn.datasets import fetch_20newsgroups

newsgroups = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))
newsgroupsdocs = newsgroups.data[:2000]
len(newsgroupsdocs)

2000

## Parte 2: Generación de Embeddings
### Actividad

1. Usa dos modelos de sentence-transformers. Puedes usar: `'all-MiniLM-L6-v2'` (SBERT), o `'intfloat/e5-base'` (E5). Cuando uses E5, antepon `"passage: "` a cada documento antes de codificar.
2. Genera los vectores de embeddings para todos los documentos usando el modelo seleccionado.
3. Guarda los embeddings en un array de NumPy para su posterior indexación.

In [2]:
# Instalar sentence-transformers
!pip install sentence-transformers

from sentence_transformers import SentenceTransformer
import numpy as np

# 1. Cargar modelo SBERT
model = SentenceTransformer('all-MiniLM-L6-v2')

# 2. Generar los embeddings
embeddings = model.encode(newsgroupsdocs, show_progress_bar=True)

# 3. Convertir a array de NumPy
embeddings = np.array(embeddings)



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/63 [00:00<?, ?it/s]

In [3]:
# Guardamos los embeddings en un array
np.save("embeddings_2000docs.npy", embeddings)

## Parte 3: Consulta
### Actividad

1. Escribe una consulta en lenguaje natural. Ejemplos:

    * "God, religion, and spirituality"
    * "space exploration"
    * "car maintenance"

2. Codifica la consulta utilizando el mismo modelo de embeddings. Cuando uses E5, antepon `"query: "` a la consulta.
3. Recupera los 5 documentos más relevantes con similitud coseno.
4. Muestra los textos de los documentos recuperados (puedes mostrar solo los primeros 500 caracteres de cada uno).

In [4]:
# Realizar consulta
query = "God, religion, and spirituality"
# Codificamos la condulta en el mismo modelo
query_embedding = model.encode(query)

from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Calcular similitud coseno entre la consulta y todos los documentos
similarities = cosine_similarity(
    [query_embedding],
    embeddings
)[0]  # obtenemos el vector 1D

# Recuperamos los 5 documentos
top_k = 5
indices = np.argsort(similarities)[::-1][:top_k]

# Mostramos los documentos recuperados
for rank, idx in enumerate(indices, start=1):
    print(f"\n=== Documento #{rank} (ID: {idx}) - Similitud: {similarities[idx]:.4f} ===")
    print(newsgroupsdocs[idx][:500], "...")  # mostrar solo 500 caracteres



=== Documento #1 (ID: 996) - Similitud: 0.4150 ===




Humanist, or sub-humanist? :-) ...

=== Documento #2 (ID: 282) - Similitud: 0.3307 ===

I didn't know God was a secular humanist...

Kent ...

=== Documento #3 (ID: 677) - Similitud: 0.3013 ===
 
(Deletion)
 
For me, it is a "I believe no gods exist" and a "I don't believe gods exist".
 
In other words, I think that statements like gods are or somehow interfere
with this world are false or meaningless. In Ontology, one can fairly
conclude that when "A exist" is meaningless A does not exist. Under the
Pragmatic definition of truth, "A exists" is meaningless makes A exist
even logically false.
 
A problem with such statements is that one can't disprove a subjective god
by definition, and ...

=== Documento #4 (ID: 943) - Similitud: 0.2878 ===


Atoms are not objective.  They aren't even real.  What scientists call
an atom is nothing more than a mathematical model that describes 
certain physical, observable properties of our surroun