# Ejercicio 10: Re-ranking

**Objetivo:** Implementar y evaluar un pipeline de Recuperación de Información en dos etapas, y analizar el impacto del re-ranking en la calidad del ranking.

## Parte 1. Preparación del corpus

* Cargar el corpus (documentos/pasajes).
* Cargar las consultas (queries).
* Cargar qrels (relevancia).

In [1]:
!pip install beir

Collecting beir
  Downloading beir-2.2.0-py3-none-any.whl.metadata (28 kB)
Collecting pytrec-eval-terrier (from beir)
  Downloading pytrec_eval_terrier-0.5.10-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (1.1 kB)
Downloading beir-2.2.0-py3-none-any.whl (77 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.4/77.4 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pytrec_eval_terrier-0.5.10-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (304 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m304.8/304.8 kB[0m [31m17.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pytrec-eval-terrier, beir
Successfully installed beir-2.2.0 pytrec-eval-terrier-0.5.10


In [2]:
from beir import util
from beir.datasets.data_loader import GenericDataLoader
import pandas as pd

  from tqdm.autonotebook import tqdm


In [3]:
DATASET_NAME = "scifact"
DATA_DIR = "../data/beir_datasets"
url = f"https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{DATASET_NAME}.zip"
util.download_and_unzip(url, DATA_DIR)

../data/beir_datasets/scifact.zip:   0%|          | 0.00/2.69M [00:00<?, ?iB/s]

'../data/beir_datasets/scifact'

In [4]:
dataset_path = DATA_DIR + "/" + DATASET_NAME
corpus, queries, qrels = GenericDataLoader(dataset_path).load(split="test")

  0%|          | 0/5183 [00:00<?, ?it/s]

In [5]:
df_corpus = (
    pd.DataFrame.from_dict(corpus, orient="index")
      .reset_index()
      .rename(columns={"index": "doc_id"})
)

df_corpus

Unnamed: 0,doc_id,text,title
0,4983,Alterations of the architecture of cerebral wh...,Microstructural development of human newborn c...
1,5836,Myelodysplastic syndromes (MDS) are age-depend...,Induction of myelodysplasia by myeloid-derived...
2,7912,ID elements are short interspersed elements (S...,"BC1 RNA, the transcript from a master gene for..."
3,18670,DNA methylation plays an important role in bio...,The DNA Methylome of Human Peripheral Blood Mo...
4,19238,Two human Golli (for gene expressed in the oli...,The human myelin basic protein gene is include...
...,...,...,...
5178,195689316,BACKGROUND The main associations of body-mass ...,Body-mass index and cause-specific mortality i...
5179,195689757,A key aberrant biological difference between t...,Targeting metabolic remodeling in glioblastoma...
5180,196664003,A signaling pathway transmits information from...,Signaling architectures that transmit unidirec...
5181,198133135,AIMS Trabecular bone score (TBS) is a surrogat...,"Association between pre-diabetes, type 2 diabe..."


In [6]:
df_queries = (
    pd.DataFrame.from_dict(queries, orient="index", columns=["query"])
      .reset_index()
      .rename(columns={"index": "query_id"})
)

df_queries

Unnamed: 0,query_id,query
0,1,0-dimensional biomaterials show inductive prop...
1,3,"1,000 genomes project enables mapping of genet..."
2,5,1/2000 in UK have abnormal PrP positivity.
3,13,5% of perinatal mortality is due to low birth ...
4,36,A deficiency of vitamin B12 increases blood le...
...,...,...
295,1379,Women with a higher birth weight are more like...
296,1382,aPKCz causes tumour enhancement by affecting g...
297,1385,cSMAC formation enhances weak ligand signalling.
298,1389,mTORC2 regulates intracellular cysteine levels...


In [7]:
rows = []
for qid, docs in qrels.items():
    for doc_id, rel in docs.items():
        rows.append({
            "query_id": qid,
            "doc_id": doc_id,
            "relevance": rel
        })

df_qrels = pd.DataFrame(rows)
df_qrels

Unnamed: 0,query_id,doc_id,relevance
0,1,31715818,1
1,3,14717500,1
2,5,13734012,1
3,13,1606628,1
4,36,5152028,1
...,...,...,...
334,1379,17450673,1
335,1382,17755060,1
336,1385,306006,1
337,1389,23895668,1


In [8]:
# Elegimos una query cualquiera que tenga varios documentos relevantes
qid = "133"

print("Query:")
print(df_queries.loc[df_queries["query_id"] == qid, "query"].values[0])

print("\nDocumentos relevantes para esta query:")
df_qrels[(df_qrels["query_id"] == qid) & (df_qrels["relevance"] > 0)]

Query:
Assembly of invadopodia is triggered by focal generation of phosphatidylinositol-3,4-biphosphate and the activation of the nonreceptor tyrosine kinase Src.

Documentos relevantes para esta query:


Unnamed: 0,query_id,doc_id,relevance
31,133,38485364,1
32,133,6969753,1
33,133,17934082,1
34,133,16280642,1
35,133,12640810,1


## Parte 2. Retrieval inicial (baseline)

* Implementar retrieval inicial con BM25
* Obtener métricas: Recall@10 nDCG@10

In [26]:
!pip install rank-bm25




In [27]:
from rank_bm25 import BM25Okapi
import numpy as np
# Creamos una lista de documentos
documents = df_corpus["text"].tolist()

# Tokenización simple
tokenized_docs = [doc.lower().split() for doc in documents]

# Mapeo índice → doc_id
doc_ids = df_corpus["doc_id"].tolist()

# Inicializamos BM25
bm25 = BM25Okapi(tokenized_docs)


In [28]:
#Recuperación para una query
# Query seleccionada
query_text = df_queries.loc[df_queries["query_id"] == qid, "query"].values[0]
tokenized_query = query_text.lower().split()

# Scores BM25
scores = bm25.get_scores(tokenized_query)

# Top-10 documentos
top_k = 10
top_indices = np.argsort(scores)[::-1][:top_k]

bm25_results = [
    {"doc_id": doc_ids[i], "score": scores[i]}
    for i in top_indices
]

bm25_results



[{'doc_id': '26688294', 'score': np.float64(55.1964401863664)},
 {'doc_id': '37964706', 'score': np.float64(50.04011691148892)},
 {'doc_id': '9507605', 'score': np.float64(50.03740998262752)},
 {'doc_id': '5270265', 'score': np.float64(45.70320340871325)},
 {'doc_id': '12785130', 'score': np.float64(45.06239524984214)},
 {'doc_id': '45764440', 'score': np.float64(45.044936360734006)},
 {'doc_id': '86694016', 'score': np.float64(44.899837173478296)},
 {'doc_id': '12640810', 'score': np.float64(44.68953632297576)},
 {'doc_id': '5821617', 'score': np.float64(44.451933317595774)},
 {'doc_id': '17934082', 'score': np.float64(44.43642301215712)}]

In [29]:
#Evaluación baseline (Recall@10 y nDCG@10)
def dcg_at_k(relevances, k):
    relevances = np.array(relevances)[:k]
    return np.sum((2**relevances - 1) / np.log2(np.arange(2, len(relevances) + 2)))

def ndcg_at_k(relevances, k):
    dcg = dcg_at_k(relevances, k)
    ideal = dcg_at_k(sorted(relevances, reverse=True), k)
    return dcg / ideal if ideal > 0 else 0



In [30]:
#Calculo de métricas
# Relevancias reales
relevant_docs = df_qrels[
    (df_qrels["query_id"] == qid) & (df_qrels["relevance"] > 0)
]["doc_id"].tolist()

retrieved_docs = [r["doc_id"] for r in bm25_results]

# Relevancias binarias en orden del ranking
relevances = [1 if doc in relevant_docs else 0 for doc in retrieved_docs]

recall_10 = sum(relevances) / len(relevant_docs)
ndcg_10 = ndcg_at_k(relevances, 10)

recall_10, ndcg_10


(0.4, np.float64(0.3706656904013186))

## Parte 3. Implementación del re-ranking _cross-encoder_

* Re-rankear los top-k candidatos para cada query.
* Identificar qué documentos cambian de posición en el top 10

In [31]:
!pip install sentence-transformers




In [32]:
from sentence_transformers import CrossEncoder
#inicializar
cross_encoder = CrossEncoder(
    "cross-encoder/ms-marco-MiniLM-L-6-v2"
)


In [33]:
#Re-rankear top-k de BM25
# Construimos pares (query, documento)
pairs = [
    (query_text, df_corpus.loc[df_corpus["doc_id"] == r["doc_id"], "text"].values[0])
    for r in bm25_results
]

# Scores cross-encoder
ce_scores = cross_encoder.predict(pairs)

# Nuevo ranking
reranked_ce = sorted(
    zip(bm25_results, ce_scores),
    key=lambda x: x[1],
    reverse=True
)

reranked_ce



[({'doc_id': '12640810', 'score': np.float64(44.68953632297576)},
  np.float32(0.57647496)),
 ({'doc_id': '9507605', 'score': np.float64(50.03740998262752)},
  np.float32(-2.2167883)),
 ({'doc_id': '17934082', 'score': np.float64(44.43642301215712)},
  np.float32(-2.3683429)),
 ({'doc_id': '86694016', 'score': np.float64(44.899837173478296)},
  np.float32(-2.4351656)),
 ({'doc_id': '37964706', 'score': np.float64(50.04011691148892)},
  np.float32(-6.994311)),
 ({'doc_id': '5821617', 'score': np.float64(44.451933317595774)},
  np.float32(-9.367951)),
 ({'doc_id': '45764440', 'score': np.float64(45.044936360734006)},
  np.float32(-9.44344)),
 ({'doc_id': '12785130', 'score': np.float64(45.06239524984214)},
  np.float32(-9.5805645)),
 ({'doc_id': '26688294', 'score': np.float64(55.1964401863664)},
  np.float32(-9.794056)),
 ({'doc_id': '5270265', 'score': np.float64(45.70320340871325)},
  np.float32(-10.447477))]

In [34]:
#Documentos que cambiaron de posición
print("Ranking BM25:")
for i, r in enumerate(bm25_results):
    print(i+1, r["doc_id"])

print("\nRanking Cross-Encoder:")
for i, (r, s) in enumerate(reranked_ce):
    print(i+1, r["doc_id"])


Ranking BM25:
1 26688294
2 37964706
3 9507605
4 5270265
5 12785130
6 45764440
7 86694016
8 12640810
9 5821617
10 17934082

Ranking Cross-Encoder:
1 12640810
2 9507605
3 17934082
4 86694016
5 37964706
6 5821617
7 45764440
8 12785130
9 26688294
10 5270265


## Parte 4. Implementación del re-ranking _LTR_

* Re-rankear los top-k candidatos para cada query.
* Identificar qué documentos cambian de posición en el top 10

In [36]:
from sklearn.linear_model import LogisticRegression


In [37]:
X = []
y = []

for (bm25_r, ce_score) in reranked_ce:
    doc_id = bm25_r["doc_id"]

    X.append([bm25_r["score"], ce_score])
    y.append(1 if doc_id in relevant_docs else 0)

X = np.array(X)
y = np.array(y)


In [38]:
#Entrenar modelo LTR
ltr_model = LogisticRegression()
ltr_model.fit(X, y)


In [39]:
#Re-ranking LTR
ltr_scores = ltr_model.predict_proba(X)[:, 1]

reranked_ltr = sorted(
    zip(bm25_results, ltr_scores),
    key=lambda x: x[1],
    reverse=True
)

reranked_ltr


[({'doc_id': '26688294', 'score': np.float64(55.1964401863664)},
  np.float64(0.8767080224864796)),
 ({'doc_id': '9507605', 'score': np.float64(50.03740998262752)},
  np.float64(0.550983480079952)),
 ({'doc_id': '5270265', 'score': np.float64(45.70320340871325)},
  np.float64(0.47674632103639114)),
 ({'doc_id': '37964706', 'score': np.float64(50.04011691148892)},
  np.float64(0.05862466359711247)),
 ({'doc_id': '45764440', 'score': np.float64(45.044936360734006)},
  np.float64(0.013245796215048167)),
 ({'doc_id': '86694016', 'score': np.float64(44.899837173478296)},
  np.float64(0.009145802593325461)),
 ({'doc_id': '12640810', 'score': np.float64(44.68953632297576)},
  np.float64(0.008300053471848635)),
 ({'doc_id': '17934082', 'score': np.float64(44.43642301215712)},
  np.float64(0.0033563421765213557)),
 ({'doc_id': '12785130', 'score': np.float64(45.06239524984214)},
  np.float64(0.0028607873795236318)),
 ({'doc_id': '5821617', 'score': np.float64(44.451933317595774)},
  np.float64(

## Parte 5. Evaluación post re-ranking

Calcular métricas:
* nDCG@10
* MAP
* Recall@10

In [40]:
#Metricas para Cross-Encoder
ce_docs = [r["doc_id"] for r, _ in reranked_ce]
ce_rels = [1 if d in relevant_docs else 0 for d in ce_docs]

ce_recall_10 = sum(ce_rels) / len(relevant_docs)
ce_ndcg_10 = ndcg_at_k(ce_rels, 10)

ce_recall_10, ce_ndcg_10


(0.4, np.float64(0.9197207891481876))

In [41]:
#Metricas para LTR
ltr_docs = [r["doc_id"] for r, _ in reranked_ltr]
ltr_rels = [1 if d in relevant_docs else 0 for d in ltr_docs]

ltr_recall_10 = sum(ltr_rels) / len(relevant_docs)
ltr_ndcg_10 = ndcg_at_k(ltr_rels, 10)

ltr_recall_10, ltr_ndcg_10


(0.4, np.float64(0.39780880120575696))

In [42]:
#MAP (Mean Average Precision)
def average_precision(relevances):
    score = 0.0
    hits = 0
    for i, rel in enumerate(relevances):
        if rel == 1:
            hits += 1
            score += hits / (i + 1)
    return score / sum(relevances) if sum(relevances) > 0 else 0


In [43]:
ap_bm25 = average_precision(relevances)
ap_ce = average_precision(ce_rels)
ap_ltr = average_precision(ltr_rels)

ap_bm25, ap_ce, ap_ltr


(0.1625, 0.8333333333333333, 0.19642857142857142)

El re-ranking con cross-encoder mejora notablemente el desempeño frente a BM25, alcanzando un MAP de 0.83 y un nDCG@10 ≈ 0.92, lo que indica un ordenamiento casi ideal de los documentos relevantes; en contraste, el enfoque LTR presenta una mejora limitada (nDCG@10 = 0.39) debido a la escasez de datos de entrenamiento, aunque ambos métodos recuperan la misma cantidad de documentos relevantes (Recall@10 = 0.4).