## Preparación del entorno en Colab

In [1]:
# Instalar librerías necesarias
!pip install faiss-cpu sentence-transformers nltk --quiet

import json
import pandas as pd
import random
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sentence_transformers import SentenceTransformer
import numpy as np
import faiss

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

###Cargamos el dataset arXiv (Kaggle a Colab)

In [2]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("Cornell-University/arxiv")

print("Path to dataset files:", path)

Using Colab cache for faster access to the 'arxiv' dataset.
Path to dataset files: /kaggle/input/arxiv


In [3]:
import json
import pandas as pd

# Load the dataset (using the path you got)
file_path = '/kaggle/input/arxiv/arxiv-metadata-oai-snapshot.json'

# Generator to read the file line by line to avoid memory issues
def get_metadata():
    with open(file_path, 'r') as f:
        for line in f:
            yield json.loads(line)

# Limitamos los documentos para disminuir el tiempo de respuesta (el total es 2938427 usaremos 100000)
metadata = get_metadata()
docs = []
for i, paper in enumerate(metadata):
    if i >= 100000: break
    docs.append({
        'id': paper['id'],
        'title': paper['title'],
        'abstract': paper['abstract'],
        'categories': paper['categories']
    })

df = pd.DataFrame(docs)
print(f"Loaded {len(df)} documents")
df.head()

Loaded 100000 documents


Unnamed: 0,id,title,abstract,categories
0,704.0001,Calculation of prompt diphoton production cros...,A fully differential calculation in perturba...,hep-ph
1,704.0002,Sparsity-certifying Graph Decompositions,"We describe a new algorithm, the $(k,\ell)$-...",math.CO cs.CG
2,704.0003,The evolution of the Earth-Moon system based o...,The evolution of Earth-Moon system is descri...,physics.gen-ph
3,704.0004,A determinant of Stirling cycle numbers counts...,We show that a determinant of Stirling cycle...,math.CO
4,704.0005,From dyadic $\Lambda_{\alpha}$ to $\Lambda_{\a...,In this paper we show how to compute the $\L...,math.CA math.FA


### Preprocesamiento de textos

In [4]:
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess(text):
    # Tokenización
    tokens = nltk.word_tokenize(text.lower())
    # Eliminación de stopwords y lematización
    tokens = [lemmatizer.lemmatize(t) for t in tokens if t.isalpha() and t not in stop_words]
    return " ".join(tokens)

# Aplicamos solo al abstract para este ejemplo
df['processed'] = df['abstract'].fillna("").apply(preprocess)
df.head()



Unnamed: 0,id,title,abstract,categories,processed
0,704.0001,Calculation of prompt diphoton production cros...,A fully differential calculation in perturba...,hep-ph,fully differential calculation perturbative qu...
1,704.0002,Sparsity-certifying Graph Decompositions,"We describe a new algorithm, the $(k,\ell)$-...",math.CO cs.CG,describe new algorithm k game color use obtain...
2,704.0003,The evolution of the Earth-Moon system based o...,The evolution of Earth-Moon system is descri...,physics.gen-ph,evolution system described dark matter field f...
3,704.0004,A determinant of Stirling cycle numbers counts...,We show that a determinant of Stirling cycle...,math.CO,show determinant stirling cycle number count u...
4,704.0005,From dyadic $\Lambda_{\alpha}$ to $\Lambda_{\a...,In this paper we show how to compute the $\L...,math.CA math.FA,paper show compute norm using dyadic grid resu...


### Generación de embeddings

In [5]:
model = SentenceTransformer('all-MiniLM-L6-v2')

# Embeddings de documentos
doc_embeddings = model.encode(df['processed'].tolist(), show_progress_bar=True, convert_to_numpy=True)
print("Embeddings de documentos generados:", doc_embeddings.shape)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Batches:   0%|          | 0/3125 [00:00<?, ?it/s]

Embeddings de documentos generados: (100000, 384)


#### Construcción del índice vectorial con FAISS

In [6]:
dim = doc_embeddings.shape[1]
index = faiss.IndexFlatL2(dim)
index.add(doc_embeddings)
print(f"FAISS index creado con {index.ntotal} documentos")



FAISS index creado con 100000 documentos


### Generación de queries (simulación)

Por lo que arXiv no tiene definidas queries ni qrels vamos a utilizar títulos de papers

In [7]:
random.seed(42)
query_samples = df.sample(5)

queries = query_samples['title'].tolist()
query_ids = query_samples['id'].tolist()

print("Queries seleccionadas:")
for q in queries:
    print("-", q)


Queries seleccionadas:
- Programming an interpreter using molecular dynamics
- High frequency intrinsic modes in El Ni\~no Southern Oscillation Index
- Isometry theorem for the Segal-Bargmann transform on noncompact
  symmetric spaces of the complex type
- Regular subalgebras of affine Kac-Moody algebras
- EuSpRIG 2006 Commercial Spreadsheet Review


### Recuperación inicial (First-Stage Retrieval)

In [8]:
top_k = 10  # Número de candidatos por query
query_embeddings = model.encode(queries, convert_to_numpy=True)

D, I = index.search(query_embeddings, top_k)  # Distancias y índices


Mostramos resultados

In [9]:
for q_idx, q in enumerate(queries):
    print(f"\nQuery: {q}")
    for rank, doc_idx in enumerate(I[q_idx]):
        print(f"{rank+1}. {df.iloc[doc_idx]['title']} ({df.iloc[doc_idx]['id']})")



Query: Programming an interpreter using molecular dynamics
1. Python Unleashed on Systems Biology (0704.3259)
2. Understanding Life with Molecular Dynamics and Thermodynamics: Comment
  on Nature 451, 240-243 (2008) (0802.2244)
3. Peptide Folding Kinetics from Replica Exchange Molecular Dynamics (0710.5533)
4. Unifying thermodynamic and kinetic descriptions of single-molecule
  processes: RNA unfolding under tension (0709.2609)
5. Thermodynamics of a model for RNA folding (0804.0221)
6. A modified Ehrenfest formalism for efficient large-scale ab initio
  molecular dynamics (0812.2801)
7. Molecular Systems with Infinite and Finite Degrees of Freedom. Part II:
  Deterministic Dynamics and Examples (0802.4279)
8. Programming an interpreter using molecular dynamics (0801.2226)
9. A Bell-Evans-Polanyi principle for molecular dynamics trajectories and
  its implications for global optimization (0705.0838)
10. Molecular Energy Relations From Chemical Kinetics (0706.0552)

Query: High frequen

### Simulación de “qrels” para evaluación

In [10]:
def simulate_qrels(queries_ids):
    qrels = {}
    for qid in queries_ids:
        query_cats = df[df['id']==qid]['categories'].values[0].split()
        relevant_docs = df[df['categories'].apply(lambda x: any(cat in x.split() for cat in query_cats))]['id'].tolist()
        qrels[qid] = relevant_docs
    return qrels

qrels = simulate_qrels(query_ids)


### Métricas: Precision@k y Recall@k

In [11]:
def precision_at_k(retrieved_ids, relevant_ids, k):
    retrieved_k = retrieved_ids[:k]
    return len(set(retrieved_k) & set(relevant_ids)) / k

def recall_at_k(retrieved_ids, relevant_ids, k):
    retrieved_k = retrieved_ids[:k]
    return len(set(retrieved_k) & set(relevant_ids)) / len(relevant_ids)

# Ejemplo con top-10
for q_idx, qid in enumerate(query_ids):
    retrieved_ids = df.iloc[I[q_idx]]['id'].tolist()
    relevant_ids = qrels[qid]
    print(f"\nQuery: {queries[q_idx]}")
    print(f"Precision@10: {precision_at_k(retrieved_ids, relevant_ids, 10):.2f}")
    print(f"Recall@10: {recall_at_k(retrieved_ids, relevant_ids, 10):.2f}")



Query: Programming an interpreter using molecular dynamics
Precision@10: 0.10
Recall@10: 0.01

Query: High frequency intrinsic modes in El Ni\~no Southern Oscillation Index
Precision@10: 0.00
Recall@10: 0.00

Query: Isometry theorem for the Segal-Bargmann transform on noncompact
  symmetric spaces of the complex type
Precision@10: 0.40
Recall@10: 0.00

Query: Regular subalgebras of affine Kac-Moody algebras
Precision@10: 0.70
Recall@10: 0.00

Query: EuSpRIG 2006 Commercial Spreadsheet Review
Precision@10: 0.20
Recall@10: 0.01


### Re-ranking con embeddings

In [12]:
from sklearn.metrics.pairwise import cosine_similarity

# Re-ranking con coseno
final_ranking = []
for q_idx, q_emb in enumerate(query_embeddings):
    candidate_embs = doc_embeddings[I[q_idx]]  # Embeddings de top-k
    sims = cosine_similarity([q_emb], candidate_embs)[0]
    ranked_idx = np.argsort(-sims)
    final_ranking.append([I[q_idx][i] for i in ranked_idx])

# Mostrar ranking final
for q_idx, ranks in enumerate(final_ranking):
    print(f"\nQuery: {queries[q_idx]} (Re-ranked)")
    for rank, doc_idx in enumerate(ranks):
        print(f"{rank+1}. {df.iloc[doc_idx]['title']} ({df.iloc[doc_idx]['id']})")



Query: Programming an interpreter using molecular dynamics (Re-ranked)
1. Python Unleashed on Systems Biology (0704.3259)
2. Understanding Life with Molecular Dynamics and Thermodynamics: Comment
  on Nature 451, 240-243 (2008) (0802.2244)
3. Peptide Folding Kinetics from Replica Exchange Molecular Dynamics (0710.5533)
4. Unifying thermodynamic and kinetic descriptions of single-molecule
  processes: RNA unfolding under tension (0709.2609)
5. Thermodynamics of a model for RNA folding (0804.0221)
6. A modified Ehrenfest formalism for efficient large-scale ab initio
  molecular dynamics (0812.2801)
7. Molecular Systems with Infinite and Finite Degrees of Freedom. Part II:
  Deterministic Dynamics and Examples (0802.4279)
8. Programming an interpreter using molecular dynamics (0801.2226)
9. A Bell-Evans-Polanyi principle for molecular dynamics trajectories and
  its implications for global optimization (0705.0838)
10. Molecular Energy Relations From Chemical Kinetics (0706.0552)

Query: 

### Análisis de Resultados

#### Discusión sobre la calidad de los resultados obtenidos.
El sistema de recuperación de información basado en embeddings sobre el dataset de arXiv logra identificar de manera moderadamente precisa los documentos relevantes en el top-10 para queries bien definidas, como “Regular subalgebras of affine Kac-Moody algebras”, donde la Precision@10 fue alta (0.7).

Sin embargo, el recall es muy bajo para casi todas las queries (0–0.01), lo que indica que solo se recupera una pequeña fracción de los documentos realmente relevantes, probablemente porque los qrels simulados incluyen todos los documentos de la categoría, y top-10 es insuficiente para cubrirlos.

####Comparación entre los resultados de la recuperación inicial y el ranking final.
No se visualizan muchos cambios entre los resultados de la recuperación inicial y tras el re-ranking, esto quizás sea debido a que el primer-stage retrieval (FAISS con embeddings) ya capturó la similitud semántica bastante bien. También la función de re-ranking solo usó coseno sobre los mismos embeddings, por lo que no añadió información nueva. Aunque lo que creo que más afectó a los resultados fue el uso de qrels y queries propias del dataset, los cuales eran documentos muy específicos... títulos científicos muy técnicos, donde el embedding de un modelo general puede no diferenciar demasiado.

Algo que debo remarcar es que el prepocesamiento también necesita una mejora, ya que algunas salidas siguen teniendo caracteres especiales o que no deberían mostrarse ni para los qrels ni para las respuestas de las búsquedas
Por ejemplo  \$gl(\infty)$, $sl(\infty)$, $so(\infty)$, and $sp(\infty)$

(PD coloqué una barra invertida para que no salga de esta forma $gl(\infty)$, $sl(\infty)$, $so(\infty)$, and $sp(\infty)$ )