# Examen Final - Sistema de Recuperación de Información
## ICCD753 Recuperación de Información 2025-B
### TREC Robust 2004 Collection

*Estudiante:* Fabian Simbaña

# Sección 1: Procesamiento de datos

In [11]:
import json
import pandas as pd
import itertools

DATA_PATH = "/kaggle/input/arxiv/arxiv-metadata-oai-snapshot.json"

def load_arxiv_sample(filepath, num_samples=10000):
    docs = []
    # Leer el archivo línea por línea para no llenar la memoria
    with open(filepath, 'r') as f:
        for line in itertools.islice(f, num_samples):
            paper = json.loads(line)
            docs.append({
                'doc_id': paper['id'],
                'title': paper['title'],
                'abstract': paper['abstract'],
                # Concatenamos título y abstract para tener más contexto
                'text': f"{paper['title']}. {paper['abstract']}" 
            })
    return pd.DataFrame(docs)

# Cargamos los datos
print("Cargando muestra del dataset arXiv...")
df_docs = load_arxiv_sample(DATA_PATH, num_samples=50000)
print(f"Documentos cargados: {len(df_docs)}")
display(df_docs.head(2))

Cargando muestra del dataset arXiv...
Documentos cargados: 50000


Unnamed: 0,doc_id,title,abstract,text
0,704.0001,Calculation of prompt diphoton production cros...,A fully differential calculation in perturba...,Calculation of prompt diphoton production cros...
1,704.0002,Sparsity-certifying Graph Decompositions,"We describe a new algorithm, the $(k,\ell)$-...",Sparsity-certifying Graph Decompositions. We...


In [12]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
import re

nltk.download('stopwords')
nltk.download('punkt')

stop_words = set(stopwords.words('english'))
stemmer = SnowballStemmer('english')

def preprocess(text):
    # Normalización
    text = str(text).lower()
    # Tokenización y limpieza
    tokens = re.findall(r'\b[a-z]+\b', text)
    # Stopwords y Stemming
    clean_tokens = [stemmer.stem(t) for t in tokens if t not in stop_words]
    return " ".join(clean_tokens)

df_docs['clean_text'] = df_docs['text'].apply(preprocess)
print("Preprocesamiento completado.")

[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Preprocesamiento completado.


# Sección 2: Representación mediante Embeddings 

In [13]:
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

# Cargar modelo eficiente
model = SentenceTransformer('all-MiniLM-L6-v2')

# Generar embeddings (usando GPU automáticamente)
print("Generando embeddings...")
doc_embeddings = model.encode(df_docs['clean_text'].tolist(), show_progress_bar=True)

# Crear índice FAISS
d = doc_embeddings.shape[1]
index = faiss.IndexFlatL2(d)
index.add(doc_embeddings)

print(f"Índice FAISS creado con {index.ntotal} vectores.")


2026-01-28 16:45:06.999260: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1769618707.246157     130 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1769618707.320510     130 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1769618707.885981     130 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1769618707.886023     130 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1769618707.886026     130 computation_placer.cc:177] computation placer alr

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Generando embeddings...


Batches:   0%|          | 0/1563 [00:00<?, ?it/s]

✅ Índice FAISS creado con 50000 vectores.


# Sección 3: Recuperación Inicial (First-Stage Retrieval) 

In [14]:
# Función de búsqueda inicial 
def search_initial(query, k=50):
    query_clean = preprocess(query)
    query_emb = model.encode([query_clean])
    D, I = index.search(query_emb, k)
    
    results = []
    for idx in I[0]:
        if idx < len(df_docs): # Seguridad por si acaso
            item = df_docs.iloc[idx].to_dict()
            results.append(item)
    return results

# Sección 4: Re-ranking de Resultados 

In [15]:
from sentence_transformers import CrossEncoder

# Modelo específico para calcular relevancia
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

def rerank_results(query, initial_results):
    if not initial_results:
        return []
        
    # Preparamos pares 
    pairs = [[query, doc['text']] for doc in initial_results]
    
    # Predecir scores
    scores = cross_encoder.predict(pairs)
    
    # Asignar scores
    for i, doc in enumerate(initial_results):
        doc['score'] = scores[i]
        
    # Ordenar descendente
    ranked_results = sorted(initial_results, key=lambda x: x['score'], reverse=True)
    return ranked_results

config.json:   0%|          | 0.00/794 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

# Sección 5: Simulación de Consultas 

In [16]:
# Definimos consultas típicas de papers científicos
topics = [
    {'qid': '1', 'query': 'quantum computing algorithms', 'keywords': ['quantum', 'algorithm']},
    {'qid': '2', 'query': 'deep learning for image recognition', 'keywords': ['deep learning', 'image', 'vision', 'cnn']},
    {'qid': '3', 'query': 'dark matter detection', 'keywords': ['dark matter', 'detect']},
    {'qid': '4', 'query': 'covid-19 vaccine effectiveness', 'keywords': ['covid', 'vaccine', 'sars-cov-2']},
    {'qid': '5', 'query': 'reinforcement learning robotics', 'keywords': ['reinforcement', 'robot']}
]

# Función para simular "juicios de relevancia" 
def get_ground_truth(doc_df, keywords):
    relevant_ids = set()
    for idx, row in doc_df.iterrows():
        # Si alguna palabra clave está en el título, es relevante
        if any(k in row['title'].lower() for k in keywords):
            relevant_ids.add(row['doc_id'])
    return relevant_ids

# Sección 6: Evaluación del Sistema 

In [19]:
# Bloque de Evaluación Actualizado (Cumple Req. 6 completo)

print(f"{'QUERY':<35} | {'P@10 (Ini)':<10} {'R@10 (Ini)':<10} | {'P@10 (Rnk)':<10} {'R@10 (Rnk)':<10}")
print("-" * 95)

metrics_data = []

for topic in topics:
    query = topic['query']
    
    # Ground Truth
    relevant_ids = get_ground_truth(df_docs, topic['keywords'])
    total_relevantes = len(relevant_ids)
    
    if total_relevantes == 0:
        continue 
        
    # Embeddings
    res_initial = search_initial(query, k=50)
    top10_initial = [doc['doc_id'] for doc in res_initial[:10]]
    
    # Re-ranking
    res_rerank = rerank_results(query, res_initial)
    top10_rerank = [doc['doc_id'] for doc in res_rerank[:10]]
    
    # --- CÁLCULO DE MÉTRICAS ---
    
    # Aciertos
    hits_initial = len(set(top10_initial) & relevant_ids)
    hits_rerank = len(set(top10_rerank) & relevant_ids)
    
    # Precision@10 
    p10_initial = hits_initial / 10.0
    p10_rerank = hits_rerank / 10.0
    
    # Recall@10 
    r10_initial = hits_initial / total_relevantes
    r10_rerank = hits_rerank / total_relevantes
    
    print(f"{query[:33]:<35} | {p10_initial:<10.2f} {r10_initial:<10.2f} | {p10_rerank:<10.2f} {r10_rerank:<10.2f}")
    
    metrics_data.append({
        'query': query,
        'p10_initial': p10_initial, 'r10_initial': r10_initial,
        'p10_rerank': p10_rerank, 'r10_rerank': r10_rerank
    })

QUERY                               | P@10 (Ini) R@10 (Ini) | P@10 (Rnk) R@10 (Rnk)
-----------------------------------------------------------------------------------------------
quantum computing algorithms        | 1.00       0.00       | 1.00       0.00      
deep learning for image recogniti   | 0.00       0.00       | 0.00       0.00      
dark matter detection               | 0.90       0.01       | 0.90       0.01      
reinforcement learning robotics     | 0.60       0.21       | 0.30       0.10      


# Sección 7: Análisis de Resultados 

In [20]:
print("\n--- ANÁLISIS DETALLADO: 'Deep Learning for Image Recognition' ---\n")

initial, reranked = example_results

print("Top 3 - Búsqueda Inicial (Embeddings):")
for i, doc in enumerate(initial[:3]):
    print(f"{i+1}. {doc['title']} (ID: {doc['doc_id']})")

print("\nTop 3 - Después del Re-ranking (Cross-Encoder):")
for i, doc in enumerate(reranked[:3]):
    print(f"{i+1}. {doc['title']} (Score: {doc['score']:.4f})")


--- ANÁLISIS DETALLADO: 'Deep Learning for Image Recognition' ---

Top 3 - Búsqueda Inicial (Embeddings):
1. Multi-Dimensional Recurrent Neural Networks (ID: 0705.2011)
2. Learning to Bluff (ID: 0705.0693)
3. Learning Similarity for Character Recognition and 3D Object Recognition (ID: 0712.0131)

Top 3 - Después del Re-ranking (Cross-Encoder):
1. Learning View Generalization Functions (Score: 0.5854)
2. Learning Similarity for Character Recognition and 3D Object Recognition (Score: -0.7003)
3. Comparing Robustness of Pairwise and Multiclass Neural-Network Systems
  for Face Recognition (Score: -1.9789)


# Conclusión Final
El pipeline implementado cumple con el diseño de un sistema moderno de dos etapas (Retrieval + Re-ranking). Los resultados evidencian que:

- Los Embeddings son robustos para encontrar similitudes temáticas generales, incluso cuando la terminología varía ligeramente.

- El Re-ranking añade una capa de "comprensión", pero su efectividad depende de que el dominio de los datos esté alineado con el entrenamiento del modelo.

- La Evaluación en ausencia de qrels oficiales es compleja, el uso de keywords como verdad absoluta tiende a subestimar la capacidad de los modelos semánticos, generando el desajuste observado.