# Ejercicio 6: Dense Retrieval e Introducción a FAISS

## Objetivo de la práctica
Generar embeddings con sentence-transformers (SBERT, E5), e indexar documentos con FAISS

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

## Parte 0: Carga del Corpus
### Actividad



1.   Carga el corpus 20 Newsgroups desde sklearn.datasets.fetch_20newsgroups
2. Limita el corpus a los primeros 2000 documentos para facilitar el procesamiento.



In [2]:
from sklearn.datasets import fetch_20newsgroups

newsgroups = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))
newsgroupsdocs = newsgroups.data[:2000]    # limitar a 2000 documentos

In [3]:
import pandas as pd
df = pd.DataFrame(newsgroupsdocs)
df

Unnamed: 0,0
0,\n\nI am sure some bashers of Pens fans are pr...
1,My brother is in the market for a high-perform...
2,\n\n\n\n\tFinally you said what you dream abou...
3,\nThink!\n\nIt's the SCSI card doing the DMA t...
4,1) I have an old Jasmine drive which I cann...
...,...
1995,"Oakland, California, Sunday, April 25th, 1:05 ..."
1996,"\n\nNo matter how ""absurd"" it is to suggest th..."
1997,Anyone here know if NCD is doing educational p...
1998,"\ntoo bad he doesn't bring the ability to hit,..."


## Parte 2: Generación de Embeddings
### Actividad
1. Usa dos modelos de sentence-transformers. Puedes usar: 'all-MiniLM-L6-v2' (SBERT), o 'intfloat/e5-base' (E5). Cuando uses E5, antepon "passage: " a cada documento antes de codificar.
2. Genera los vectores de embeddings para todos los documentos usando el modelo seleccionado.
3. Guarda los embeddings en un array de NumPy para su posterior indexación.

In [4]:
# dependencias
!pip install sentence-transformers
!pip install scikit-learn



In [5]:
#Seleccionando el modelo: SBERT
from sentence_transformers import SentenceTransformer
import numpy as np

model_name = "all-MiniLM-L6-v2"

In [7]:
model = SentenceTransformer(model_name)

docs_to_encode = newsgroupsdocs

# Generar embeddings
embeddings = model.encode(docs_to_encode)

In [8]:
# Convertir a NumPy array
embeddings = np.array(embeddings)


In [9]:
print("Shape de los embeddings:", embeddings.shape)

Shape de los embeddings: (2000, 384)


## Parte 3: Consulta
### Actividad
1. Escribe una consulta en lenguaje natural. Ejemplos:
* "God, religion, and spirituality"
* "space exploration"
* "car maintenance"
2. Codifica la consulta utilizando el mismo modelo de embeddings. Cuando uses E5, antepon "query: " a la consulta.

3. Recupera los 5 documentos más relevantes con similitud coseno.

4. Muestra los textos de los documentos recuperados (puedes mostrar solo los primeros 500 caracteres de cada uno).


In [10]:
from sklearn.metrics.pairwise import cosine_similarity

query = "space exploration"


In [11]:
query_to_encode = query

# Codificar consulta
query_embedding = model.encode([query_to_encode])

# Calcular similitud coseno
scores = cosine_similarity(query_embedding, embeddings)[0]

# Obtener top 5
top_k = 5
top_indices = scores.argsort()[::-1][:top_k]

In [13]:
print("Top 5 documentos relevantes")
for idx in top_indices:
    print(f"--- Documento {idx} (score={scores[idx]:.4f}) ---")
    print(newsgroupsdocs[idx][:500].replace("\n"," "))
    print("\n")

Top 5 documentos relevantes
--- Documento 495 (score=0.4991) ---
I am posting this for a friend without internet access. Please inquire to the phone number and address listed. ---------------------------------------------------------------------  "Space: Teaching's Newest Frontier" Sponsored by the Planetary Studies Foundation  The Planetary Studies Foundation is sponsoring a one week class for teachers called "Space: Teaching's Newest Frontier." The class will be held at the Sheraton Suites in Elk Grove, Illinois from June 14 through June 18. Participants wh


--- Documento 1643 (score=0.4398) ---
 Well, here goes.  The first item of business is to establish the importance space life sciences in the whole of scheme of humankind.  I mean compared to football and baseball, the average joe schmoe doesn't seem interested or even curious about spaceflight.  I think that this forum can make a major change in that lack of insight and education.  All of us, in our own way, can contribute to a