# Ejercicio 10: Re-ranking

## Darlin Anacicha    Curso: GR1CC

**Objetivo:** Implementar y evaluar un pipeline de Recuperación de Información en dos etapas, y analizar el impacto del re-ranking en la calidad del ranking.

## Parte 1. Preparación del corpus

* Cargar el corpus (documentos/pasajes).
* Cargar las consultas (queries).
* Cargar qrels (relevancia).

In [35]:
!pip install beir




In [36]:
from beir import util
from beir.datasets.data_loader import GenericDataLoader
import pandas as pd

In [37]:
DATASET_NAME = "scifact"
DATA_DIR = "../data/beir_datasets"
url = f"https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{DATASET_NAME}.zip"
util.download_and_unzip(url, DATA_DIR)

'../data/beir_datasets/scifact'

In [38]:
dataset_path = DATA_DIR + "/" + DATASET_NAME
corpus, queries, qrels = GenericDataLoader(dataset_path).load(split="test")

  0%|          | 0/5183 [00:00<?, ?it/s]

In [39]:
df_corpus = (
    pd.DataFrame.from_dict(corpus, orient="index")
      .reset_index()
      .rename(columns={"index": "doc_id"})
)

df_corpus

Unnamed: 0,doc_id,text,title
0,4983,Alterations of the architecture of cerebral wh...,Microstructural development of human newborn c...
1,5836,Myelodysplastic syndromes (MDS) are age-depend...,Induction of myelodysplasia by myeloid-derived...
2,7912,ID elements are short interspersed elements (S...,"BC1 RNA, the transcript from a master gene for..."
3,18670,DNA methylation plays an important role in bio...,The DNA Methylome of Human Peripheral Blood Mo...
4,19238,Two human Golli (for gene expressed in the oli...,The human myelin basic protein gene is include...
...,...,...,...
5178,195689316,BACKGROUND The main associations of body-mass ...,Body-mass index and cause-specific mortality i...
5179,195689757,A key aberrant biological difference between t...,Targeting metabolic remodeling in glioblastoma...
5180,196664003,A signaling pathway transmits information from...,Signaling architectures that transmit unidirec...
5181,198133135,AIMS Trabecular bone score (TBS) is a surrogat...,"Association between pre-diabetes, type 2 diabe..."


In [40]:
df_queries = (
    pd.DataFrame.from_dict(queries, orient="index", columns=["query"])
      .reset_index()
      .rename(columns={"index": "query_id"})
)

df_queries

Unnamed: 0,query_id,query
0,1,0-dimensional biomaterials show inductive prop...
1,3,"1,000 genomes project enables mapping of genet..."
2,5,1/2000 in UK have abnormal PrP positivity.
3,13,5% of perinatal mortality is due to low birth ...
4,36,A deficiency of vitamin B12 increases blood le...
...,...,...
295,1379,Women with a higher birth weight are more like...
296,1382,aPKCz causes tumour enhancement by affecting g...
297,1385,cSMAC formation enhances weak ligand signalling.
298,1389,mTORC2 regulates intracellular cysteine levels...


In [41]:
rows = []
for qid, docs in qrels.items():
    for doc_id, rel in docs.items():
        rows.append({
            "query_id": qid,
            "doc_id": doc_id,
            "relevance": rel
        })

df_qrels = pd.DataFrame(rows)
df_qrels

Unnamed: 0,query_id,doc_id,relevance
0,1,31715818,1
1,3,14717500,1
2,5,13734012,1
3,13,1606628,1
4,36,5152028,1
...,...,...,...
334,1379,17450673,1
335,1382,17755060,1
336,1385,306006,1
337,1389,23895668,1


In [42]:
# Elegimos una query cualquiera que tenga varios documentos relevantes
qid = "133"

print("Query:")
print(df_queries.loc[df_queries["query_id"] == qid, "query"].values[0])

print("\nDocumentos relevantes para esta query:")
df_qrels[(df_qrels["query_id"] == qid) & (df_qrels["relevance"] > 0)]

Query:
Assembly of invadopodia is triggered by focal generation of phosphatidylinositol-3,4-biphosphate and the activation of the nonreceptor tyrosine kinase Src.

Documentos relevantes para esta query:


Unnamed: 0,query_id,doc_id,relevance
31,133,38485364,1
32,133,6969753,1
33,133,17934082,1
34,133,16280642,1
35,133,12640810,1


## Parte 2. Retrieval inicial (baseline)

* Implementar retrieval inicial con BM25
* Obtener métricas: Recall@10 nDCG@10

In [43]:
!pip install rank-bm25

from rank_bm25 import BM25Okapi
import nltk
nltk.download("punkt")
from nltk.tokenize import word_tokenize




[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [44]:
# Preprocesar corpus
tokenized_corpus = [
    word_tokenize(doc["text"].lower())
    for doc in corpus.values()
]

bm25 = BM25Okapi(tokenized_corpus)
doc_ids = list(corpus.keys())


In [45]:
def bm25_retrieve(query, top_k=10):
    tokenized_query = word_tokenize(query.lower())
    scores = bm25.get_scores(tokenized_query)
    top_idx = scores.argsort()[::-1][:top_k]
    return [(doc_ids[i], scores[i]) for i in top_idx]


In [46]:
query_text = queries[qid]
bm25_results = bm25_retrieve(query_text, top_k=10)
bm25_results


[('5270265', np.float64(59.853134662962574)),
 ('26688294', np.float64(58.67574493391684)),
 ('45764440', np.float64(58.14946033570541)),
 ('12785130', np.float64(55.52989817971354)),
 ('37964706', np.float64(55.02827881096887)),
 ('9507605', np.float64(53.13107970005058)),
 ('35884026', np.float64(52.360529973361274)),
 ('5914739', np.float64(51.94921374614491)),
 ('10991183', np.float64(50.82883355478484)),
 ('86694016', np.float64(49.19138716813143))]

In [50]:
# Solo queries que tienen qrels
valid_queries = {qid: queries[qid] for qid in qrels.keys()}

print("Queries totales:", len(queries))
print("Queries con qrels:", len(valid_queries))


Queries totales: 300
Queries con qrels: 300


In [51]:
retrieval_results = {}

for qid_iter, query_text in valid_queries.items():
    bm25_results = bm25_retrieve(query_text, top_k=10)
    retrieval_results[qid_iter] = {
        doc_id: float(score) for doc_id, score in bm25_results
    }


In [52]:
from beir.retrieval.evaluation import EvaluateRetrieval

evaluator = EvaluateRetrieval()
ndcg, _map, recall, precision = evaluator.evaluate(
    qrels,
    retrieval_results,
    k_values=[10]
)

print("NDCG@10:", ndcg["NDCG@10"])
print("MAP@10:", _map["MAP@10"])
print("Recall@10:", recall["Recall@10"])
print("P@10:", precision["P@10"])


NDCG@10: 0.62006
MAP@10: 0.57898
Recall@10: 0.73417
P@10: 0.08


## Parte 3. Implementación del re-ranking _cross-encoder_

* Re-rankear los top-k candidatos para cada query.
* Identificar qué documentos cambian de posición en el top 10

In [53]:
from sentence_transformers import CrossEncoder

cross_encoder = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")


2026-01-12 18:39:34.144751: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1768243174.403979      55 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1768243174.478678      55 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1768243175.081956      55 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1768243175.081996      55 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1768243175.081999      55 computation_placer.cc:177] computation placer alr

config.json:   0%|          | 0.00/794 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

In [54]:
pairs = [
    (query_text, corpus[doc_id]["text"])
    for doc_id, _ in bm25_results
]

scores = cross_encoder.predict(pairs)

reranked = sorted(
    zip([doc_id for doc_id, _ in bm25_results], scores),
    key=lambda x: x[1],
    reverse=True
)

reranked


[('18218379', np.float32(-4.296707)),
 ('13923069', np.float32(-5.0027514)),
 ('3113630', np.float32(-5.2604175)),
 ('928281', np.float32(-6.2111826)),
 ('16630060', np.float32(-7.3663197)),
 ('7711685', np.float32(-7.4259653)),
 ('4387484', np.float32(-7.469968)),
 ('40987633', np.float32(-8.129323)),
 ('26079071', np.float32(-9.052057)),
 ('11569583', np.float32(-9.458267))]

In [55]:
print("BM25 order:")
print([doc for doc, _ in bm25_results])

print("\nCross-Encoder order:")
print([doc for doc, _ in reranked])


BM25 order:
['13923069', '928281', '4387484', '18218379', '16630060', '11569583', '26079071', '40987633', '3113630', '7711685']

Cross-Encoder order:
['18218379', '13923069', '3113630', '928281', '16630060', '7711685', '4387484', '40987633', '26079071', '11569583']


## Parte 4. Implementación del re-ranking _LTR_

* Re-rankear los top-k candidatos para cada query.
* Identificar qué documentos cambian de posición en el top 10

In [65]:
X = []
y = []

for qid_iter, query_text in valid_queries.items():
    bm25_results = bm25_retrieve(query_text, top_k=10)

    for doc_id, score in bm25_results:
        X.append([score])
        y.append(
            1 if ((df_qrels["query_id"] == qid_iter) &
                  (df_qrels["doc_id"] == doc_id) &
                  (df_qrels["relevance"] > 0)).any()
            else 0
        )

X = np.array(X)
y = np.array(y)

np.unique(y, return_counts=True)


(array([0, 1]), array([2760,  240]))

In [66]:
from sklearn.linear_model import LogisticRegression

ltr = LogisticRegression(max_iter=1000)
ltr.fit(X, y)


In [67]:
bm25_results = bm25_retrieve(queries[qid], top_k=10)

X_test = np.array([[score] for _, score in bm25_results])
ltr_scores = ltr.predict_proba(X_test)[:, 1]

ltr_reranked = sorted(
    zip([doc_id for doc_id, _ in bm25_results], ltr_scores),
    key=lambda x: x[1],
    reverse=True
)

ltr_reranked


[('5270265', np.float64(0.2250428164498337)),
 ('26688294', np.float64(0.21450517800559976)),
 ('45764440', np.float64(0.20991241262601038)),
 ('12785130', np.float64(0.18813082950124704)),
 ('37964706', np.float64(0.18416388503035236)),
 ('9507605', np.float64(0.16974630498929472)),
 ('35884026', np.float64(0.16415239601331882)),
 ('5914739', np.float64(0.16122753155848205)),
 ('10991183', np.float64(0.15347400503596376)),
 ('86694016', np.float64(0.14269444770716566))]

In [68]:
bm25_order = [doc_id for doc_id, _ in bm25_results]
ltr_order = [doc_id for doc_id, _ in ltr_reranked]

print("Ranking BM25:")
for i, doc in enumerate(bm25_order, 1):
    print(f"{i:2d}. {doc}")

print("\nRanking LTR:")
for i, doc in enumerate(ltr_order, 1):
    print(f"{i:2d}. {doc}")


Ranking BM25:
 1. 5270265
 2. 26688294
 3. 45764440
 4. 12785130
 5. 37964706
 6. 9507605
 7. 35884026
 8. 5914739
 9. 10991183
10. 86694016

Ranking LTR:
 1. 5270265
 2. 26688294
 3. 45764440
 4. 12785130
 5. 37964706
 6. 9507605
 7. 35884026
 8. 5914739
 9. 10991183
10. 86694016


## Parte 5. Evaluación post re-ranking

Calcular métricas:
* nDCG@10
* MAP
* Recall@10

In [72]:
ltr_results_all = {}

for qid, query_text in valid_queries.items():
    bm25_results = bm25_retrieve(query_text, top_k=10)

    X_test = np.array([[score] for _, score in bm25_results])
    ltr_scores = ltr.predict_proba(X_test)[:, 1]

    ltr_results_all[qid] = {
        doc_id: float(score)
        for (doc_id, _), score in zip(bm25_results, ltr_scores)
    }


In [73]:
ndcg_ltr, map_ltr, recall_ltr, p_ltr = evaluator.evaluate(
    qrels,
    ltr_results_all,
    k_values=[10]
)

print("LTR NDCG@10:", ndcg_ltr["NDCG@10"])
print("LTR MAP@10:", map_ltr["MAP@10"])
print("LTR Recall@10:", recall_ltr["Recall@10"])


LTR NDCG@10: 0.62006
LTR MAP@10: 0.57898
LTR Recall@10: 0.73417


In [74]:
qid_example = list(valid_queries.keys())[0]

bm25_example = bm25_retrieve(valid_queries[qid_example], top_k=10)
X_ex = np.array([[score] for _, score in bm25_example])
ltr_scores_ex = ltr.predict_proba(X_ex)[:, 1]

ltr_example = sorted(
    zip([doc_id for doc_id, _ in bm25_example], ltr_scores_ex),
    key=lambda x: x[1],
    reverse=True
)

print("BM25:", [d for d, _ in bm25_example])
print("LTR :", [d for d, _ in ltr_example])


BM25: ['10931595', '10906636', '12824568', '13231899', '825728', '19651306', '24998637', '42240424', '17518195', '4465608']
LTR : ['10931595', '10906636', '12824568', '13231899', '825728', '19651306', '24998637', '42240424', '17518195', '4465608']
