# Examen - Sistema de Recuperación de Información

**Nombre:** Josune Singaña
**Fecha:** 28 enero 2025

Notebook implementa un sistema completo de Recuperación de Información utilizando:
- **Dataset:** rXiv Dataset. Scholarly articles, from the vast branches of physics to the many subdisciplines of computer science
- **Pipeline:** Preprocesamiento -Embeddings-Búsqueda Vectorial-Re-ranking
- **Evaluación:** Precision@k, Recall@k usando qrels creadas


## 1. Instalación de Dependencias


In [5]:
# Instalación de librerías necesarias
!pip install ir-datasets
!pip install sentence-transformers
!pip install faiss-cpu
!pip install nltk
!pip install pandas
!pip install numpy
!pip install tqdm
!pip install scikit-learn
!pip install kaggle

print("✓ Instalación completada")

✓ Instalación completada


In [2]:
# Importar todas las librerías
import ir_datasets
import pandas as pd
import numpy as np
import nltk
import re
import faiss
from tqdm import tqdm
from collections import defaultdict
from sentence_transformers import SentenceTransformer, CrossEncoder
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn.metrics.pairwise import cosine_similarity

# Descargar recursos de NLTK
nltk.download('stopwords')
nltk.download('punkt')

print("✓ Librerías importadas exitosamente")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


✓ Librerías importadas exitosamente


## 2. Descarga y carga del dataset arXiv

En esta sección se descarga y carga el dataset **arXiv Metadata**

In [7]:
import os
# Configurar credenciales (Sube tu kaggle.json a Colab )
os.environ['KAGGLE_CONFIG_DIR'] = "/content" # O la ruta donde esté el json
# Descargar usando la API
!kaggle datasets download -d Cornell-University/arxiv
# Descomprimir
!unzip arxiv.zip

Dataset URL: https://www.kaggle.com/datasets/Cornell-University/arxiv
License(s): CC0-1.0
Downloading arxiv.zip to /content
 97% 1.51G/1.56G [00:10<00:01, 38.4MB/s]
100% 1.56G/1.56G [00:11<00:00, 152MB/s] 
Archive:  arxiv.zip
  inflating: arxiv-metadata-oai-snapshot.json  


In [13]:
# Cargar el dataset arXiv el archivo arxiv-metadata-oai-snapshot.json a Colab
import json

# Cargar el archivo JSON
arxiv_data = []

# Lee el archivo línea
with open('arxiv-metadata-oai-snapshot.json', 'r') as f:
    for i, line in enumerate(f):
        if i >= 50000:  # Limitar a 50k documentos
            break
        arxiv_data.append(json.loads(line))

print(f"Dataset cargado: arXiv")
print(f"Total de documentos cargados: {len(arxiv_data)}")

Dataset cargado: arXiv
Total de documentos cargados: 50000


In [14]:
# Ejemplo documentos
print("EJEMPLO DE DOCUMENTOS")

for i in range(min(3, len(arxiv_data))):
    doc = arxiv_data[i]
    print(f"Document ID: {doc['id']}")
    print(f"Título: {doc['title']}")
    print(f"Categorías: {doc['categories']}")
    abstract = doc['abstract'].replace('\n', ' ')
    print(f"Abstract (primeros 300 caracteres): {abstract[:300]}...")
    print("-" * 80 + "\n")

EJEMPLO DE DOCUMENTOS
Document ID: 0704.0001
Título: Calculation of prompt diphoton production cross sections at Tevatron and
  LHC energies
Categorías: hep-ph
Abstract (primeros 300 caracteres):   A fully differential calculation in perturbative quantum chromodynamics is presented for the production of massive photon pairs at hadron colliders. All next-to-leading order perturbative contributions from quark-antiquark, gluon-(anti)quark, and gluon-gluon subprocesses are included, as well as a...
--------------------------------------------------------------------------------

Document ID: 0704.0002
Título: Sparsity-certifying Graph Decompositions
Categorías: math.CO cs.CG
Abstract (primeros 300 caracteres):   We describe a new algorithm, the $(k,\ell)$-pebble game with colors, and use it obtain a characterization of the family of $(k,\ell)$-sparse graphs and algorithmic solutions to a family of problems concerning tree decompositions of graphs. Special instances of sparse graphs appear 

## 3. Preprocesamiento de Datos

Aplicamos las siguientes técnicas de preprocesamiento:
1. **Tokenización:** División del texto en palabras
2. **Normalización:** Conversión a minúsculas
3. **Eliminación de stopwords:** Remover palabras comunes sin significado
4. **Stemming:** Reducir palabras a su raíz

In [17]:
# Inicializar herramientas de preprocesamiento
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()

def clean_text(text):
    # Convertir a minúsculas
    text = text.lower()
    # Eliminar caracteres especiales y números, conservar solo letras y espacios
    text = re.sub(r'[^a-z\s]', ' ', text)
    # Eliminar espacios múltiples
    text = re.sub(r'\s+', ' ', text)
    return text.strip()

def tokenize(text):
# Tokeniza el texto en palabras individuales.
    return text.split()

def remove_stopwords(tokens):
# Elimina las stopwords de la lista de tokens.
    return [word for word in tokens if word not in stop_words and len(word) > 2]

def apply_stemming(tokens):
# Aplica stemming a cada token.
    return [stemmer.stem(token) for token in tokens]

def preprocess_text(text):
# Pipeline completo de preprocesamiento.
    text = clean_text(text)
    tokens = tokenize(text)
    tokens = remove_stopwords(tokens)
    tokens = apply_stemming(tokens)
    return ' '.join(tokens)

In [18]:
# Probar el preprocesamiento con un ejemplo
sample_text = "The International Conflict in the Middle East has escalated dramatically!"
processed_text = preprocess_text(sample_text)

print("Texto original:")
print(sample_text)
print("\nTexto procesado:")
print(processed_text)

Texto original:
The International Conflict in the Middle East has escalated dramatically!

Texto procesado:
intern conflict middl east escal dramat


In [27]:
# Preprocesar todos los documentos de arXiv
print("Preprocesando documentos arXiv...")

documents = []
doc_ids = []
processed_docs = []

for doc in tqdm(arxiv_data, desc="Preprocesando documentos"):
    doc_ids.append(doc['id'])
    # Combinar título y abstract para mejor búsqueda
    full_text = f"{doc['title']} {doc['abstract']}"
    documents.append(full_text)
    processed_docs.append(preprocess_text(full_text))

print(f"\n {len(documents)} documentos preprocesados")
print(f"Ejemplo de documento original: {documents[0][:200]}...")
print(f"\nEjemplo de documento procesado: {processed_docs[0][:200]}...")

Preprocesando documentos arXiv...


Preprocesando documentos: 100%|██████████| 50000/50000 [01:20<00:00, 623.53it/s]


 50000 documentos preprocesados
Ejemplo de documento original: Calculation of prompt diphoton production cross sections at Tevatron and
  LHC energies   A fully differential calculation in perturbative quantum chromodynamics is
presented for the production of mas...

Ejemplo de documento procesado: calcul prompt diphoton product cross section tevatron lhc energi fulli differenti calcul perturb quantum chromodynam present product massiv photon pair hadron collid next lead order perturb contribut ...





In [29]:
# Crear queries sintéticas basadas en categorías de arXiv
print("CREANDO QUERIES SINTÉTICAS")

class Query:
    def __init__(self, query_id, title, description, narrative):
        self.query_id = query_id
        self.title = title
        self.description = description
        self.narrative = narrative

queries_list = [
    Query("Q1", "machine learning algorithms",
          "Papers about machine learning and deep learning algorithms",
          "Relevant documents discuss machine learning techniques, neural networks, and AI algorithms."),
    Query("Q2", "quantum computing",
          "Research on quantum computing and quantum information",
          "Relevant papers cover quantum algorithms, quantum mechanics applications in computing."),
    Query("Q3", "computer vision",
          "Papers about image processing and computer vision",
          "Documents about image recognition, object detection, and visual analysis."),
    Query("Q4", "natural language processing",
          "NLP and text analysis research",
          "Papers covering text mining, language models, and computational linguistics."),
    Query("Q5", "astrophysics cosmology",
          "Research in astrophysics and cosmology",
          "Documents about stars, galaxies, dark matter, and universe structure."),
    Query("Q6", "molecular biology genetics",
          "Papers on molecular biology and genetics",
          "Research on DNA, RNA, proteins, and genetic mechanisms."),
    Query("Q7", "climate change models",
          "Climate modeling and environmental science",
          "Papers about climate prediction, global warming, and environmental impacts."),
    Query("Q8", "cryptography security",
          "Cryptographic methods and security protocols",
          "Documents covering encryption, security protocols, and data protection."),
    Query("Q9", "robotics automation",
          "Research on robotics and autonomous systems",
          "Papers about robot control, autonomous navigation, and mechatronics."),
    Query("Q10", "renewable energy",
          "Studies on renewable energy sources",
          "Documents about solar, wind, and sustainable energy technologies.")
]

# Mostrar ejemplos
for i, query in enumerate(queries_list[:5]):
    print(f"Query ID: {query.query_id}")
    print(f"Title: {query.title}")
    print(f"Description: {query.description}")
    print(f"Narrative: {query.narrative}")
    print("-" * 80 + "\n")

print(f"✓ Total de queries creadas: {len(queries_list)}\n")

CREANDO QUERIES SINTÉTICAS
Query ID: Q1
Title: machine learning algorithms
Description: Papers about machine learning and deep learning algorithms
Narrative: Relevant documents discuss machine learning techniques, neural networks, and AI algorithms.
--------------------------------------------------------------------------------

Query ID: Q2
Title: quantum computing
Description: Research on quantum computing and quantum information
Narrative: Relevant papers cover quantum algorithms, quantum mechanics applications in computing.
--------------------------------------------------------------------------------

Query ID: Q3
Title: computer vision
Description: Papers about image processing and computer vision
Narrative: Documents about image recognition, object detection, and visual analysis.
--------------------------------------------------------------------------------

Query ID: Q4
Title: natural language processing
Description: NLP and text analysis research
Narrative: Papers coverin

In [33]:
# Crear qrels sintéticos basados en categorías
from collections import defaultdict

print("GENERANDO QRELS SINTÉTICOS")

qrels_dict = defaultdict(dict)

# Mapeo de queries a categorías relevantes de arXiv
query_category_map = {
    "Q1": ["cs.LG", "cs.AI", "stat.ML"],          # machine learning
    "Q2": ["quant-ph", "cs.ET"],                   # quantum computing
    "Q3": ["cs.CV"],                               # computer vision
    "Q4": ["cs.CL", "cs.AI"],                      # NLP
    "Q5": ["astro-ph"],                            # astrophysics
    "Q6": ["q-bio.GN", "q-bio.MN"],               # genetics
    "Q7": ["physics.ao-ph", "physics.geo-ph"],    # climate
    "Q8": ["cs.CR"],                               # cryptography
    "Q9": ["cs.RO"],                               # robotics
    "Q10": ["physics.soc-ph", "eess.SY"],         # renewable energy
}

# Asignar relevancia basada en categorías
print("Asignando relevancia a documentos...")
for doc in tqdm(arxiv_data, desc="Generando qrels"):
    doc_categories = doc['categories'].split()

    for query_id, relevant_cats in query_category_map.items():
        relevance = 0
        for cat in doc_categories:
            # Coincidencia exacta de categoría = altamente relevante
            if any(cat.startswith(rel_cat) for rel_cat in relevant_cats):
                relevance = 2
                break
            # Coincidencia de categoría principal = parcialmente relevante
            elif any(cat.split('.')[0] == rel_cat.split('.')[0] for rel_cat in relevant_cats):
                relevance = 1

        if relevance > 0:
            qrels_dict[query_id][doc['id']] = relevance

# Mostrar estadísticas
print("ESTADÍSTICAS DE QRELS")
for query_id in sorted(qrels_dict.keys()):
    num_relevant = len(qrels_dict[query_id])
    num_highly_relevant = sum(1 for rel in qrels_dict[query_id].values() if rel == 2)
    print(f"{query_id}: {num_relevant} docs relevantes ({num_highly_relevant} altamente relevantes)")

print(f"\nTotal de pares query-doc relevantes: {sum(len(docs) for docs in qrels_dict.values())}")
print(f"Queries con documentos relevantes: {len(qrels_dict)}")

# Mostrar algunos ejemplos
print("EJEMPLOS DE QRELS")

count = 0
for query_id, docs in qrels_dict.items():
    for doc_id, relevance in list(docs.items())[:2]:
        print(f"Query: {query_id} | Doc: {doc_id} | Relevance: {relevance}")
        count += 1
        if count >= 10:
            break
    if count >= 10:
        break

print("\nQrels generados exitosamente\n")

GENERANDO QRELS SINTÉTICOS
Asignando relevancia a documentos...


Generando qrels: 100%|██████████| 50000/50000 [00:02<00:00, 20488.98it/s]

ESTADÍSTICAS DE QRELS
Q1: 3474 docs relevantes (297 altamente relevantes)
Q10: 4007 docs relevantes (523 altamente relevantes)
Q2: 5797 docs relevantes (3155 altamente relevantes)
Q3: 2689 docs relevantes (58 altamente relevantes)
Q4: 2689 docs relevantes (236 altamente relevantes)
Q5: 10272 docs relevantes (10272 altamente relevantes)
Q6: 846 docs relevantes (192 altamente relevantes)
Q7: 4007 docs relevantes (219 altamente relevantes)
Q8: 2689 docs relevantes (145 altamente relevantes)
Q9: 2689 docs relevantes (97 altamente relevantes)

Total de pares query-doc relevantes: 39159
Queries con documentos relevantes: 10
EJEMPLOS DE QRELS
Query: Q1 | Doc: 0704.0002 | Relevance: 1
Query: Q1 | Doc: 0704.0046 | Relevance: 1
Query: Q2 | Doc: 0704.0002 | Relevance: 1
Query: Q2 | Doc: 0704.0034 | Relevance: 2
Query: Q3 | Doc: 0704.0002 | Relevance: 1
Query: Q3 | Doc: 0704.0046 | Relevance: 1
Query: Q4 | Doc: 0704.0002 | Relevance: 1
Query: Q4 | Doc: 0704.0046 | Relevance: 1
Query: Q8 | Doc: 070




In [35]:
# Preprocesar las queries
print("Preprocesando queries...")

query_ids = []
query_texts = []
processed_queries = []

for query in queries_list:
    query_ids.append(query.query_id)
    # Combinar title y description para mejor recuperación
    full_query = f"{query.title} {query.description}"
    query_texts.append(full_query)
    processed_queries.append(preprocess_text(full_query))

print(f"\n{len(query_texts)} queries preprocesadas")
print(f"\nEjemplo de query original: {query_texts[0]}")
print(f"Ejemplo de query procesada: {processed_queries[0]}")

Preprocesando queries...

10 queries preprocesadas

Ejemplo de query original: machine learning algorithms Papers about machine learning and deep learning algorithms
Ejemplo de query procesada: machin learn algorithm paper machin learn deep learn algorithm


## 4. Generación de Embeddings

Sentence-BERT (modelo pre-entrenado) para generar embeddings semánticos de:
- Documentos del corpus
- Queries de búsqueda

Modelo usado: all-MiniLM-L6-v2 (rápido y eficiente para embeddings)

In [36]:
# Cargar el modelo de embeddings
print("Cargando modelo de embeddings...")
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
print(f"✓ Modelo cargado: {embedding_model}")
print(f"Dimensión de embeddings: {embedding_model.get_sentence_embedding_dimension()}")

Cargando modelo de embeddings...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

✓ Modelo cargado: SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False, 'architecture': 'BertModel'})
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)
Dimensión de embeddings: 384


In [37]:
# Generar embeddings para los documentos
print("Generando embeddings para documentos...")
# Usar los textos originales para mejores embeddings semánticos
batch_size = 32
doc_embeddings = embedding_model.encode(
    documents,
    batch_size=batch_size,
    show_progress_bar=True,
    convert_to_numpy=True
)

print(f"\nEmbeddings de documentos generados")
print(f"Shape: {doc_embeddings.shape}")
print(f"Tipo de datos: {doc_embeddings.dtype}")

Generando embeddings para documentos...


Batches:   0%|          | 0/1563 [00:00<?, ?it/s]


Embeddings de documentos generados
Shape: (50000, 384)
Tipo de datos: float32


In [39]:
# Generar embeddings para las queries
print("Generando embeddings para queries...")

query_embeddings = embedding_model.encode(
    query_texts,
    show_progress_bar=True,
    convert_to_numpy=True
)

print(f"\n Embeddings de queries generados")
print(f"Shape: {query_embeddings.shape}")

Generando embeddings para queries...


Batches:   0%|          | 0/1 [00:00<?, ?it/s]


 Embeddings de queries generados
Shape: (10, 384)


## 5. Construcción del Índice FAISS

Utiliza
**FAISS** (Facebook AI Similarity Search) para crear un índice vectorial eficiente que permita:
- Búsqueda rápida de vecinos más cercanos
- Recuperación de top-k documentos candidatos

In [41]:
# Construir índice FAISS
print("Construyendo índice FAISS...")

# Dimensión de los embeddings
dimension = doc_embeddings.shape[1]

# Crear índice flat (búsqueda exacta por fuerza bruta - preciso pero más lento)
# Para datasets grandes, se puede usar IndexIVFFlat para mayor velocidad
index = faiss.IndexFlatIP(dimension)  # IP = Inner Product (similar a cosine similarity)

# Normalizar embeddings para que el producto interno sea equivalente a similitud coseno
faiss.normalize_L2(doc_embeddings)

# Agregar vectores al índice
index.add(doc_embeddings)

print(f"\nÍndice FAISS construido")
print(f"Total de vectores en el índice: {index.ntotal}")
print(f"Dimensión: {dimension}")

Construyendo índice FAISS...

Índice FAISS construido
Total de vectores en el índice: 50000
Dimensión: 384


## 6. Recuperación Inicial (First-Stage Retrieval)

Implementar la búsqueda vectorial para recuperar los **top-k documentos candidatos** para cada query.

In [42]:
# Función para recuperación inicial
def retrieve_initial_candidates(query_embedding, k=100):
#    Recupera los top-k documentos más similares a la query.
    # Normalizar la query
    query_embedding = query_embedding.reshape(1, -1).astype('float32')
    faiss.normalize_L2(query_embedding)

    # Buscar los k vecinos más cercanos
    distances, indices = index.search(query_embedding, k)

    return distances[0], indices[0]

print("Función de recuperación inicial definida")

Función de recuperación inicial definida


In [43]:
# Probar recuperación inicial con una query de ejemplo
test_query_idx = 0
test_query = query_texts[test_query_idx]
test_query_id = query_ids[test_query_idx]

print(f"Query de prueba: '{test_query}'")
print(f"Query ID: {test_query_id}")

# Recuperar top-10 documentos
scores, retrieved_indices = retrieve_initial_candidates(query_embeddings[test_query_idx], k=10)

print("\nTOP 10 DOCUMENTOS RECUPERADOS (Recuperación Inicial):")
print("="*80)

for rank, (idx, score) in enumerate(zip(retrieved_indices, scores), 1):
    doc_id = doc_ids[idx]
    doc_text = documents[idx][:200]
    print(f"\nRank {rank}: Score={score:.4f}")
    print(f"Doc ID: {doc_id}")
    print(f"Texto: {doc_text}...")
    print("-" * 80)

Query de prueba: 'machine learning algorithms Papers about machine learning and deep learning algorithms'
Query ID: Q1

TOP 10 DOCUMENTOS RECUPERADOS (Recuperación Inicial):

Rank 1: Score=0.4051
Doc ID: 0708.2321
Texto: Fast learning rates for plug-in classifiers   It has been recently shown that, under the margin (or low noise) assumption,
there exist classifiers attaining fast rates of convergence of the excess Bay...
--------------------------------------------------------------------------------

Rank 2: Score=0.3910
Doc ID: 0707.0303
Texto: Learning from dependent observations   In most papers establishing consistency for learning algorithms it is assumed
that the observations used for training are realizations of an i.i.d. process.
In t...
--------------------------------------------------------------------------------

Rank 3: Score=0.3900
Doc ID: 0709.1201
Texto: On the Proof Complexity of Deep Inference   We obtain two results about the proof complexity of deep inference: 1)


In [44]:
# Recuperar candidatos para todas las queries
K_CANDIDATES = 100  # Número de candidatos a recuperar por query
initial_results = {}

for i, (query_id, query_emb) in enumerate(tqdm(zip(query_ids, query_embeddings), total=len(query_ids))):
    scores, indices = retrieve_initial_candidates(query_emb, k=K_CANDIDATES)

    # Guardar resultados
    initial_results[query_id] = [
        (doc_ids[idx], float(score))
        for idx, score in zip(indices, scores)
    ]

print(f"\nRecuperación inicial completada para {len(initial_results)} queries")
print(f"Promedio de documentos por query: {np.mean([len(docs) for docs in initial_results.values()]):.1f}")

100%|██████████| 10/10 [00:00<00:00, 52.61it/s]


Recuperación inicial completada para 10 queries
Promedio de documentos por query: 100.0





## 7. Re-ranking de Resultados

Implementar un **cross-encoder** para re-rankear los documentos candidatos.

Los cross-encoders son más precisos que los bi-encoders porque procesan query y documento juntos, pero son más lentos (por eso solo los usamos en los top-k candidatos).

In [45]:
# Cargar modelo de re-ranking
print("Cargando modelo de re-ranking (Cross-Encoder)...")
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
print(f"Modelo de re-ranking cargado: {reranker}")

Cargando modelo de re-ranking (Cross-Encoder)...


config.json:   0%|          | 0.00/794 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

Modelo de re-ranking cargado: CrossEncoder(
  (model): BertForSequenceClassification(
    (bert): BertModel(
      (embeddings): BertEmbeddings(
        (word_embeddings): Embedding(30522, 384, padding_idx=0)
        (position_embeddings): Embedding(512, 384)
        (token_type_embeddings): Embedding(2, 384)
        (LayerNorm): LayerNorm((384,), eps=1e-12, elementwise_affine=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
      (encoder): BertEncoder(
        (layer): ModuleList(
          (0-5): 6 x BertLayer(
            (attention): BertAttention(
              (self): BertSdpaSelfAttention(
                (query): Linear(in_features=384, out_features=384, bias=True)
                (key): Linear(in_features=384, out_features=384, bias=True)
                (value): Linear(in_features=384, out_features=384, bias=True)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (output): BertSelfOutput(
                (dense): Linear(in_f

In [46]:
# Función de re-ranking
def rerank_documents(query_text, candidate_doc_ids, top_k=10):
  # Re-rankea los documentos candidatos usando un cross-encoder.
    # Obtener textos de los documentos candidatos
    doc_id_to_idx = {doc_id: idx for idx, doc_id in enumerate(doc_ids)}
    candidate_texts = [documents[doc_id_to_idx[doc_id]] for doc_id in candidate_doc_ids]

    # Crear pares (query, documento) para el cross-encoder
    pairs = [[query_text, doc_text] for doc_text in candidate_texts]

    # Calcular scores de relevancia
    scores = reranker.predict(pairs)

    # Ordenar por score descendente
    ranked_results = sorted(
        zip(candidate_doc_ids, scores),
        key=lambda x: x[1],
        reverse=True
    )

    return ranked_results[:top_k]

print("Función de re-ranking definida")

Función de re-ranking definida


In [47]:
# Probar re-ranking con la misma query de ejemplo
print(f"Query de prueba: '{test_query}'")

# Obtener candidatos iniciales (top-50 para re-ranking)
initial_candidate_ids = [doc_id for doc_id, _ in initial_results[test_query_id][:50]]

# Re-rankear
print("Aplicando re-ranking...")
reranked_results = rerank_documents(test_query, initial_candidate_ids, top_k=10)
print("\nTOP 10 DOCUMENTOS DESPUÉS DE RE-RANKING:")

doc_id_to_idx = {doc_id: idx for idx, doc_id in enumerate(doc_ids)}
for rank, (doc_id, score) in enumerate(reranked_results, 1):
    idx = doc_id_to_idx[doc_id]
    doc_text = documents[idx][:200]
    print(f"\nRank {rank}: Score={score:.4f}")
    print(f"Doc ID: {doc_id}")
    print(f"Texto: {doc_text}...")
    print("-" * 80)

Query de prueba: 'machine learning algorithms Papers about machine learning and deep learning algorithms'
Aplicando re-ranking...

TOP 10 DOCUMENTOS DESPUÉS DE RE-RANKING:

Rank 1: Score=2.3719
Doc ID: 0712.4126
Texto: TRUST-TECH based Methods for Optimization and Learning   Many problems that arise in machine learning domain deal with nonlinearity
and quite often demand users to obtain global optimal solutions rath...
--------------------------------------------------------------------------------

Rank 2: Score=1.4214
Doc ID: 0707.0303
Texto: Learning from dependent observations   In most papers establishing consistency for learning algorithms it is assumed
that the observations used for training are realizations of an i.i.d. process.
In t...
--------------------------------------------------------------------------------

Rank 3: Score=1.1026
Doc ID: 0712.1027
Texto: Kernels and Ensembles: Perspectives on Statistical Learning   Since their emergence in the 1990's, the support vector

In [48]:
# Re-rankear para todas las queries
print("Aplicando re-ranking a todas las queries...")

RERANK_TOP_K = 50  # Cuántos candidatos re-rankear por query
FINAL_TOP_K = 10   # Cuántos documentos finales retornar

reranked_results = {}

for query_id, query_text in tqdm(zip(query_ids, query_texts), total=len(query_ids), desc="Re-ranking"):
    # Obtener top candidatos iniciales
    initial_candidates = [doc_id for doc_id, _ in initial_results[query_id][:RERANK_TOP_K]]

    # Re-rankear
    reranked = rerank_documents(query_text, initial_candidates, top_k=FINAL_TOP_K)
    reranked_results[query_id] = reranked

print(f"\nRe-ranking completado para {len(reranked_results)} queries")

Aplicando re-ranking a todas las queries...


Re-ranking: 100%|██████████| 10/10 [00:02<00:00,  4.32it/s]


Re-ranking completado para 10 queries





## 8. Simulación de Consultas

Ejecuta múltiples consultas y mostramos los resultados antes y después del re-ranking.

In [51]:
# Función para visualizar resultados de una query
def display_query_results(query_id, query_text, initial_results_list, reranked_results_list, top_n=5):
    print("\n" + "="*100)
    print(f"QUERY ID: {query_id}")
    print(f"QUERY TEXT: {query_text}")
    print("="*100)

    doc_id_to_idx = {doc_id: idx for idx, doc_id in enumerate(doc_ids)}

    # Resultados iniciales
    print(f"\n{'RECUPERACIÓN INICIAL (Top-' + str(top_n) + ')':^100}")
    print("-" * 100)
    for rank, (doc_id, score) in enumerate(initial_results_list[:top_n], 1):
        if doc_id in doc_id_to_idx:
            idx = doc_id_to_idx[doc_id]
            doc_snippet = documents[idx][:150]
            print(f"\n[{rank}] Score: {score:.4f} | Doc ID: {doc_id}")
            print(f"    {doc_snippet}...")

    # Resultados re-rankeados
    print(f"\n\n{'DESPUÉS DE RE-RANKING (Top-' + str(top_n) + ')':^100}")
    print("-" * 100)
    for rank, (doc_id, score) in enumerate(reranked_results_list[:top_n], 1):
        if doc_id in doc_id_to_idx:
            idx = doc_id_to_idx[doc_id]
            doc_snippet = documents[idx][:150]
            print(f"\n[{rank}] Score: {score:.4f} | Doc ID: {doc_id}")
            print(f"    {doc_snippet}...")



print("Función de visualización definida")

Función de visualización definida


In [52]:
# Mostrar resultados para las primeras 3 queries
print("VISUALIZACIÓN DE RESULTADOS PARA QUERIES DE MUESTRA")

for i in range(min(3, len(query_ids))):
    qid = query_ids[i]
    qtext = query_texts[i]
    initial = initial_results[qid]
    reranked = reranked_results[qid]

    display_query_results(qid, qtext, initial, reranked, top_n=5)

VISUALIZACIÓN DE RESULTADOS PARA QUERIES DE MUESTRA

QUERY ID: Q1
QUERY TEXT: machine learning algorithms Papers about machine learning and deep learning algorithms

                                    RECUPERACIÓN INICIAL (Top-5)                                    
----------------------------------------------------------------------------------------------------

[1] Score: 0.4051 | Doc ID: 0708.2321
    Fast learning rates for plug-in classifiers   It has been recently shown that, under the margin (or low noise) assumption,
there exist classifiers att...

[2] Score: 0.3910 | Doc ID: 0707.0303
    Learning from dependent observations   In most papers establishing consistency for learning algorithms it is assumed
that the observations used for tr...

[3] Score: 0.3900 | Doc ID: 0709.1201
    On the Proof Complexity of Deep Inference   We obtain two results about the proof complexity of deep inference: 1)
deep-inference proof systems are as...

[4] Score: 0.3865 | Doc ID: 0712.4126
  

## 9. Evaluación del Sistema

Evaluar la calidad del sistema usando:
- **Precision@k:** Proporción de documentos relevantes en los top-k resultados
- **Recall@k:** Proporción de documentos relevantes recuperados del total de relevantes

Comparamos el rendimiento antes y después del re-ranking.

In [54]:
# Funciones de evaluación
def precision_at_k(retrieved_docs, relevant_docs, k):
    if k == 0 or len(retrieved_docs) == 0:
        return 0.0

    top_k = retrieved_docs[:k]
    relevant_retrieved = sum(1 for doc_id in top_k if doc_id in relevant_docs)

    return relevant_retrieved / k

def recall_at_k(retrieved_docs, relevant_docs, k):
    if len(relevant_docs) == 0:
        return 0.0

    top_k = retrieved_docs[:k]
    relevant_retrieved = sum(1 for doc_id in top_k if doc_id in relevant_docs)

    return relevant_retrieved / len(relevant_docs)

def evaluate_results(results_dict, qrels_dict, k_values=[5, 10, 20]):
    metrics = {k: {'precision': [], 'recall': []} for k in k_values}

    for query_id in results_dict:
        if query_id not in qrels_dict:
            continue

        # Obtener documentos recuperados
        retrieved = [doc_id for doc_id, _ in results_dict[query_id]]

        # Obtener documentos relevantes (relevance > 0)
        relevant = set([doc_id for doc_id, rel in qrels_dict[query_id].items() if rel > 0])

        if len(relevant) == 0:
            continue

        # Calcular métricas para cada k
        for k in k_values:
            prec = precision_at_k(retrieved, relevant, k)
            rec = recall_at_k(retrieved, relevant, k)
            metrics[k]['precision'].append(prec)
            metrics[k]['recall'].append(rec)

    # Calcular promedios
    avg_metrics = {}
    for k in k_values:
        avg_metrics[k] = {
            'precision': np.mean(metrics[k]['precision']) if metrics[k]['precision'] else 0.0,
            'recall': np.mean(metrics[k]['recall']) if metrics[k]['recall'] else 0.0
        }

    return avg_metrics

print("Funciones de evaluación definidas")

Funciones de evaluación definidas


In [55]:
# Evaluar resultados de recuperación inicial
k_values = [5, 10, 20]
initial_metrics = evaluate_results(initial_results, qrels_dict, k_values)

print("MÉTRICAS - RECUPERACIÓN INICIAL")

for k in k_values:
    print(f"\nk = {k}:")
    print(f"  Precision@{k}: {initial_metrics[k]['precision']:.4f}")
    print(f"  Recall@{k}:    {initial_metrics[k]['recall']:.4f}")

MÉTRICAS - RECUPERACIÓN INICIAL

k = 5:
  Precision@5: 0.7600
  Recall@5:    0.0013

k = 10:
  Precision@10: 0.7500
  Recall@10:    0.0027

k = 20:
  Precision@20: 0.7850
  Recall@20:    0.0059


In [56]:
# Evaluar resultados después de re-ranking

reranked_metrics = evaluate_results(reranked_results, qrels_dict, k_values)

print("MÉTRICAS - DESPUÉS DE RE-RANKING")

for k in k_values:
    print(f"\nk = {k}:")
    print(f"  Precision@{k}: {reranked_metrics[k]['precision']:.4f}")
    print(f"  Recall@{k}:    {reranked_metrics[k]['recall']:.4f}")

MÉTRICAS - DESPUÉS DE RE-RANKING

k = 5:
  Precision@5: 0.8200
  Recall@5:    0.0016

k = 10:
  Precision@10: 0.7800
  Recall@10:    0.0030

k = 20:
  Precision@20: 0.3900
  Recall@20:    0.0030


In [57]:
# Comparación lado a lado
print("COMPARACIÓN: RECUPERACIÓN INICIAL vs RE-RANKING")

comparison_df = pd.DataFrame({
    'Métrica': [],
    'Inicial': [],
    'Re-ranking': [],
    'Mejora (%)': []
})

for k in k_values:
    for metric in ['precision', 'recall']:
        initial_val = initial_metrics[k][metric]
        reranked_val = reranked_metrics[k][metric]
        improvement = ((reranked_val - initial_val) / initial_val * 100) if initial_val > 0 else 0

        comparison_df = pd.concat([comparison_df, pd.DataFrame({
            'Métrica': [f"{metric.capitalize()}@{k}"],
            'Inicial': [f"{initial_val:.4f}"],
            'Re-ranking': [f"{reranked_val:.4f}"],
            'Mejora (%)': [f"{improvement:+.2f}%"]
        })], ignore_index=True)

print("\n")
print(comparison_df.to_string(index=False))

COMPARACIÓN: RECUPERACIÓN INICIAL vs RE-RANKING


     Métrica Inicial Re-ranking Mejora (%)
 Precision@5  0.7600     0.8200     +7.89%
    Recall@5  0.0013     0.0016    +29.72%
Precision@10  0.7500     0.7800     +4.00%
   Recall@10  0.0027     0.0030    +10.50%
Precision@20  0.7850     0.3900    -50.32%
   Recall@20  0.0059     0.0030    -48.48%


La comparación muestra que el re-ranking mejora el rendimiento en los primeros puestos del ranking, que son los más importantes desde el punto de vista del usuario. Se observa un aumento consistente en Precision@5 y Precision@10, indicando que el modelo de re-ranking logra colocar más documentos relevantes

#### CALIDAD DE LA RECUPERACIÓN INICIAL:


La recuperación inicial basada en embeddings semánticos (Sentence-BERT) proporciona una primera selección de documentos candidatos usando similitud vectorial.

In [58]:
print(f"- Precision@10 inicial: {initial_metrics[10]['precision']:.4f}")
print(f"- Recall@10 inicial:    {initial_metrics[10]['recall']:.4f}")


- Precision@10 inicial: 0.7500
- Recall@10 inicial:    0.0027


#### IMPACTO DEL RE-RANKING:

El re-ranking con Cross-Encoder mejora significativamente la calidad de los resultados. Los cross-encoders procesan query y documento conjuntamente, capturando interacciones más complejas y matices semánticos.


In [59]:
precision_improvement = ((reranked_metrics[10]['precision'] - initial_metrics[10]['precision']) /
                         initial_metrics[10]['precision'] * 100) if initial_metrics[10]['precision'] > 0 else 0
recall_improvement = ((reranked_metrics[10]['recall'] - initial_metrics[10]['recall']) /
                      initial_metrics[10]['recall'] * 100) if initial_metrics[10]['recall'] > 0 else 0

print(f"- Mejora en Precision@10: {precision_improvement:+.2f}%")
print(f"- Mejora en Recall@10:    {recall_improvement:+.2f}%")

- Mejora en Precision@10: +4.00%
- Mejora en Recall@10:    +10.50%


## Conclusión
- Pipeline basado en embeddings
- Balance entre eficiencia y precisión
- Mejora medible con re-ranking
- Escalable a corpus grandes


