# Ejercicio 6: Dense Retrieval e Introducción a FAISS

## Objetivo de la práctica

Generar embeddings con sentence-transformers (SBERT, E5), e indexar documentos con FAISS 

## Parte 0: Carga del Corpus
### Actividad

1. Carga el corpus 20 Newsgroups desde sklearn.datasets.fetch_20newsgroups.
2. Limita el corpus a los primeros 2000 documentos para facilitar el procesamiento.

In [7]:
from sklearn.datasets import fetch_20newsgroups
import pandas as pd
newsgroups = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))
#limitar a los primeros 2000 documentos 
newsgroups.data = newsgroups.data[:2000]
newsgroupsdocs = newsgroups.data

#mostrar en un dataframe el id y el texto de los documentos
corpus_df = pd.DataFrame({'id': range(len(newsgroupsdocs)), 'doc': newsgroupsdocs})
corpus_df


Unnamed: 0,id,doc
0,0,\n\nI am sure some bashers of Pens fans are pr...
1,1,My brother is in the market for a high-perform...
2,2,\n\n\n\n\tFinally you said what you dream abou...
3,3,\nThink!\n\nIt's the SCSI card doing the DMA t...
4,4,1) I have an old Jasmine drive which I cann...
...,...,...
1995,1995,"Oakland, California, Sunday, April 25th, 1:05 ..."
1996,1996,"\n\nNo matter how ""absurd"" it is to suggest th..."
1997,1997,Anyone here know if NCD is doing educational p...
1998,1998,"\ntoo bad he doesn't bring the ability to hit,..."


## Parte 2: Generación de Embeddings
### Actividad

1. Usa dos modelos de sentence-transformers. Puedes usar: `'all-MiniLM-L6-v2'` (SBERT), o `'intfloat/e5-base'` (E5). Cuando uses E5, antepon `"passage: "` a cada documento antes de codificar.
2. Genera los vectores de embeddings para todos los documentos usando el modelo seleccionado.
3. Guarda los embeddings en un array de NumPy para su posterior indexación.

In [8]:
pip install -U sentence-transformers

Defaulting to user installation because normal site-packages is not writeable
Collecting sentence-transformers
  Downloading sentence_transformers-4.1.0-py3-none-any.whl.metadata (13 kB)
Collecting transformers<5.0.0,>=4.41.0 (from sentence-transformers)
  Downloading transformers-4.52.4-py3-none-any.whl.metadata (38 kB)
Collecting torch>=1.11.0 (from sentence-transformers)
  Downloading torch-2.7.1-cp313-cp313-win_amd64.whl.metadata (28 kB)
Collecting huggingface-hub>=0.20.0 (from sentence-transformers)
  Downloading huggingface_hub-0.33.0-py3-none-any.whl.metadata (14 kB)
Collecting filelock (from huggingface-hub>=0.20.0->sentence-transformers)
  Downloading filelock-3.18.0-py3-none-any.whl.metadata (2.9 kB)
Collecting fsspec>=2023.5.0 (from huggingface-hub>=0.20.0->sentence-transformers)
  Downloading fsspec-2025.5.1-py3-none-any.whl.metadata (11 kB)
Collecting pyyaml>=5.1 (from huggingface-hub>=0.20.0->sentence-transformers)
  Using cached PyYAML-6.0.2-cp313-cp313-win_amd64.whl.met

ERROR: Could not install packages due to an OSError: [WinError 32] El proceso no tiene acceso al archivo porque está siendo utilizado por otro proceso: 'C:\\Users\\ELI\\AppData\\Local\\Packages\\PythonSoftwareFoundation.Python.3.13_qbz5n2kfra8p0\\LocalCache\\local-packages\\Python313\\site-packages\\tokenizers\\tokenizers.pyd'
Check the permissions.


[notice] A new release of pip is available: 25.0.1 -> 25.1.1
[notice] To update, run: C:\Users\ELI\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.13_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


In [14]:
from sentence_transformers import SentenceTransformer
sbert_model = SentenceTransformer('all-MiniLM-L6-v2')
#obtener los embeddings de los documentos
corpus_embeddings = sbert_model.encode(corpus_df['doc'].tolist(), show_progress_bar=True, convert_to_tensor=True)
#agregar los embeddings al dataframe
corpus_df['embeddings_sbert'] = corpus_embeddings.tolist()
#mostrar el tamaño de los embeddings
print(corpus_embeddings.shape)
#mostrar el dataframe con los embeddings
corpus_df

Batches: 100%|██████████| 63/63 [00:48<00:00,  1.31it/s]


torch.Size([2000, 384])


Unnamed: 0,id,doc,embeddings_sbert
0,0,\n\nI am sure some bashers of Pens fans are pr...,"[0.0020780046470463276, 0.02345043234527111, 0..."
1,1,My brother is in the market for a high-perform...,"[0.05006030574440956, 0.0269809328019619, -0.0..."
2,2,\n\n\n\n\tFinally you said what you dream abou...,"[0.016404753550887108, 0.08100050687789917, -0..."
3,3,\nThink!\n\nIt's the SCSI card doing the DMA t...,"[-0.01939147524535656, 0.011494365520775318, -..."
4,4,1) I have an old Jasmine drive which I cann...,"[-0.03928707540035248, -0.05540286749601364, -..."
...,...,...,...
1995,1995,"Oakland, California, Sunday, April 25th, 1:05 ...","[0.044003989547491074, 0.03598788380622864, -0..."
1996,1996,"\n\nNo matter how ""absurd"" it is to suggest th...","[-0.08084699511528015, 0.017292389646172523, -..."
1997,1997,Anyone here know if NCD is doing educational p...,"[-0.07489252090454102, -0.0004223576979711652,..."
1998,1998,"\ntoo bad he doesn't bring the ability to hit,...","[0.0978073701262474, 0.042095087468624115, -0...."


In [15]:
from sentence_transformers import SentenceTransformer
E5_Model = SentenceTransformer('intfloat/e5-base')
#obtener los embeddings de los documentos 
corpus_embeddings_e5 = E5_Model.encode(
    ["passage: " + doc for doc in corpus_df['doc'].tolist()],
    show_progress_bar=True,
    convert_to_tensor=True
)
#agregar los embeddings al dataframe
corpus_df['embeddings_e5'] = corpus_embeddings_e5.tolist()
corpus_df

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Batches: 100%|██████████| 63/63 [08:05<00:00,  7.70s/it]


Unnamed: 0,id,doc,embeddings_sbert,embeddings_e5
0,0,\n\nI am sure some bashers of Pens fans are pr...,"[0.0020780046470463276, 0.02345043234527111, 0...","[-0.057998958975076675, -0.0020638704299926758..."
1,1,My brother is in the market for a high-perform...,"[0.05006030574440956, 0.0269809328019619, -0.0...","[-0.047147322446107864, 0.00045925582526251674..."
2,2,\n\n\n\n\tFinally you said what you dream abou...,"[0.016404753550887108, 0.08100050687789917, -0...","[-0.03237044811248779, 0.024496663361787796, -..."
3,3,\nThink!\n\nIt's the SCSI card doing the DMA t...,"[-0.01939147524535656, 0.011494365520775318, -...","[-0.07731803506612778, 0.017821243032813072, -..."
4,4,1) I have an old Jasmine drive which I cann...,"[-0.03928707540035248, -0.05540286749601364, -...","[-0.03879633918404579, 0.0034529452677816153, ..."
...,...,...,...,...
1995,1995,"Oakland, California, Sunday, April 25th, 1:05 ...","[0.044003989547491074, 0.03598788380622864, -0...","[-0.05249633267521858, 0.03624464571475983, -0..."
1996,1996,"\n\nNo matter how ""absurd"" it is to suggest th...","[-0.08084699511528015, 0.017292389646172523, -...","[-0.006697574630379677, 0.031097760424017906, ..."
1997,1997,Anyone here know if NCD is doing educational p...,"[-0.07489252090454102, -0.0004223576979711652,...","[-0.04950818046927452, 0.032965317368507385, 0..."
1998,1998,"\ntoo bad he doesn't bring the ability to hit,...","[0.0978073701262474, 0.042095087468624115, -0....","[-0.07545769214630127, 0.02335001528263092, -0..."


## Parte 3: Indexación con FAISS
### Actividad

1. Crea un índice plano con faiss.IndexFlatL2 para búsquedas por distancia euclidiana.
2. Asegúrate de usar la dimensión correcta `(embedding_dim = doc_embeddings.shape[1])`.
3. Agrega los vectores de documentos al índice.

## Parte 4: Consulta Semántica
### Actividad

1. Escribe una consulta en lenguaje natural. Ejemplos:

    * "God, religion, and spirituality"
    * "space exploration"
    * "car maintenance"

2. Codifica la consulta utilizando el mismo modelo de embeddings. Cuando uses E5, antepon `"query: "` a la consulta.
3. Recupera los 5 documentos más relevantes con `index.search(...)`.
4. Muestra los textos de los documentos recuperados (puedes mostrar solo los primeros 500 caracteres de cada uno).