# Ejercicio 6: Dense Retrieval e Introducción a FAISS

## Objetivo de la práctica

Generar embeddings con sentence-transformers (SBERT, E5), e indexar documentos con FAISS

## Parte 0: Carga del Corpus
### Actividad

1. Carga el corpus 20 Newsgroups desde sklearn.datasets.fetch_20newsgroups.
2. Limita el corpus a los primeros 2000 documentos para facilitar el procesamiento.

In [None]:
from sklearn.datasets import fetch_20newsgroups

newsgroups = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))
newsgroupsdocs = newsgroups.data

In [None]:
corpus = newsgroupsdocs[:2000]

In [None]:
len(corpus)

2000

## Parte 2: Generación de Embeddings
### Actividad

1. Usa dos modelos de sentence-transformers. Puedes usar: `'all-MiniLM-L6-v2'` (SBERT), o `'intfloat/e5-base'` (E5). Cuando uses E5, antepon `"passage: "` a cada documento antes de codificar.
2. Genera los vectores de embeddings para todos los documentos usando el modelo seleccionado.
3. Guarda los embeddings en un array de NumPy para su posterior indexación.

Cargamos a Alberto (SBERT) en el siguiente código y lo preparamos con nuestro corpus.

In [None]:
import numpy as np
from sentence_transformers import SentenceTransformer

alberto = SentenceTransformer('all-MiniLM-L6-v2')

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
embeddings_alberto = alberto.encode(
    corpus,
    show_progress_bar=True,
    convert_to_numpy=True
)

Batches:   0%|          | 0/63 [00:00<?, ?it/s]

Realizamos lo mismo con Efraín (E5) para conseguir sus embeddings. Primero consideremos añadir a todos los documentos la cadena `"passage: "`

In [None]:
efrain = SentenceTransformer('intfloat/e5-base')

modules.json:   0%|          | 0.00/387 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/645 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/356 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/200 [00:00<?, ?B/s]

In [None]:
corpus_modified = ["passage: " + doc for doc in corpus]

embedding_efrain = efrain.encode(
    corpus_modified,
    show_progress_bar=True,
    convert_to_numpy=True
)

Batches:   0%|          | 0/63 [00:00<?, ?it/s]

## Parte 3: Consulta
### Actividad

1. Escribe una consulta en lenguaje natural. Ejemplos:

    * "God, religion, and spirituality"
    * "space exploration"
    * "car maintenance"

2. Codifica la consulta utilizando el mismo modelo de embeddings. Cuando uses E5, antepon `"query: "` a la consulta.
3. Recupera los 5 documentos más relevantes con similitud coseno.
4. Muestra los textos de los documentos recuperados (puedes mostrar solo los primeros 500 caracteres de cada uno).

Usaremos una función de sklearn para calcular la similitud coseno. Abajo se muestra una función que hace un ranking e imprime los documentos junto con su puntaje.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

def ranking(query, embeddings, corpus):
  puntajes = cosine_similarity(query, embeddings)[0]

  ranking = np.argsort(puntajes)[::-1][:5]

  for i in ranking:
    print("------------------------------------------------------------------------------------------")
    print(f"Numero de documento: {i}")
    print(f"Similitud: {puntajes[i]}")
    print(f"Documento: {corpus[i][:500]}")
    print()
    print()

La función anterior es compatible con cualquiera de los dos embeddings de los modelos.

Probemos ahora con Alberto (SBERT) usando una query y calculando su embedding desde el modelo de Alberto.

In [None]:
query_embedding = alberto.encode("space exploration", convert_to_numpy=True).reshape(1, -1)

In [None]:
ranking(query_embedding, embeddings_alberto, corpus)

------------------------------------------------------------------------------------------
Numero de documento: 495
Similitud: 0.49910426139831543
Documento: I am posting this for a friend without internet access. Please inquire
to the phone number and address listed.
---------------------------------------------------------------------

"Space: Teaching's Newest Frontier"
Sponsored by the Planetary Studies Foundation

The Planetary Studies Foundation is sponsoring a one week class for
teachers called "Space: Teaching's Newest Frontier." The class will be
held at the Sheraton Suites in Elk Grove, Illinois from June 14 through
June 18. Participants wh


------------------------------------------------------------------------------------------
Numero de documento: 1643
Similitud: 0.43979012966156006
Documento: 
Well, here goes.

The first item of business is to establish the importance space life
sciences in the whole of scheme of humankind.  I mean compared
to football and baseball, the

Finalmente, usemos el modelo de Efraín (E5), calculando el embedding de la query. Recordemos que debemos añadir la cadena `"query:"` al documento.

In [None]:
query_embedding = efrain.encode("query: space exploration", convert_to_numpy=True).reshape(1, -1)

In [None]:
ranking(query_embedding, embedding_efrain, corpus)

------------------------------------------------------------------------------------------
Numero de documento: 25
Similitud: 0.8190337419509888
Documento: AW&ST  had a brief blurb on a Manned Lunar Exploration confernce
May 7th  at Crystal City Virginia, under the auspices of AIAA.

Does anyone know more about this?  How much, to attend????

Anyone want to go?


------------------------------------------------------------------------------------------
Numero de documento: 1643
Similitud: 0.8179237246513367
Documento: 
Well, here goes.

The first item of business is to establish the importance space life
sciences in the whole of scheme of humankind.  I mean compared
to football and baseball, the average joe schmoe doesn't seem interested
or even curious about spaceflight.  I think that this forum can
make a major change in that lack of insight and education.

All of us, in our own way, can contribute to a comprehensive document
which can be released to the general public around the wor