# Ejercicio 10: Re-ranking

Objetivo: Implementar y evaluar un pipeline de Recuperación de Información en dos etapas, y analizar el impacto del re-ranking en la calidad del ranking.

## Parte 1. Preparación del corpus
- Cargar el corpus (documentos/pasajes).
- Cargar las consultas (queries).
- Cargar qrels (relevancia).

In [2]:
pip install beir

Collecting beir
  Downloading beir-2.2.0-py3-none-any.whl.metadata (28 kB)
Collecting pytrec-eval-terrier (from beir)
  Downloading pytrec_eval_terrier-0.5.10-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (1.1 kB)
Downloading beir-2.2.0-py3-none-any.whl (77 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.4/77.4 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pytrec_eval_terrier-0.5.10-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (304 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m304.8/304.8 kB[0m [31m21.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pytrec-eval-terrier, beir
Successfully installed beir-2.2.0 pytrec-eval-terrier-0.5.10


In [3]:
from beir import util
from beir.datasets.data_loader import GenericDataLoader
import pandas as pd

  from tqdm.autonotebook import tqdm


In [4]:
DATASET_NAME = "scifact"
DATA_DIR = "../data/beir_datasets"
url = f"https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{DATASET_NAME}.zip"
util.download_and_unzip(url, DATA_DIR)

../data/beir_datasets/scifact.zip:   0%|          | 0.00/2.69M [00:00<?, ?iB/s]

'../data/beir_datasets/scifact'

In [5]:
dataset_path = DATA_DIR + "/" + DATASET_NAME
corpus, queries, qrels = GenericDataLoader(dataset_path).load(split="test")

  0%|          | 0/5183 [00:00<?, ?it/s]

In [6]:
df_corpus = (
    pd.DataFrame.from_dict(corpus, orient="index")
      .reset_index()
      .rename(columns={"index": "doc_id"})
)

df_corpus

Unnamed: 0,doc_id,text,title
0,4983,Alterations of the architecture of cerebral wh...,Microstructural development of human newborn c...
1,5836,Myelodysplastic syndromes (MDS) are age-depend...,Induction of myelodysplasia by myeloid-derived...
2,7912,ID elements are short interspersed elements (S...,"BC1 RNA, the transcript from a master gene for..."
3,18670,DNA methylation plays an important role in bio...,The DNA Methylome of Human Peripheral Blood Mo...
4,19238,Two human Golli (for gene expressed in the oli...,The human myelin basic protein gene is include...
...,...,...,...
5178,195689316,BACKGROUND The main associations of body-mass ...,Body-mass index and cause-specific mortality i...
5179,195689757,A key aberrant biological difference between t...,Targeting metabolic remodeling in glioblastoma...
5180,196664003,A signaling pathway transmits information from...,Signaling architectures that transmit unidirec...
5181,198133135,AIMS Trabecular bone score (TBS) is a surrogat...,"Association between pre-diabetes, type 2 diabe..."


In [7]:
df_queries = (
    pd.DataFrame.from_dict(queries, orient="index", columns=["query"])
      .reset_index()
      .rename(columns={"index": "query_id"})
)

df_queries

Unnamed: 0,query_id,query
0,1,0-dimensional biomaterials show inductive prop...
1,3,"1,000 genomes project enables mapping of genet..."
2,5,1/2000 in UK have abnormal PrP positivity.
3,13,5% of perinatal mortality is due to low birth ...
4,36,A deficiency of vitamin B12 increases blood le...
...,...,...
295,1379,Women with a higher birth weight are more like...
296,1382,aPKCz causes tumour enhancement by affecting g...
297,1385,cSMAC formation enhances weak ligand signalling.
298,1389,mTORC2 regulates intracellular cysteine levels...


In [8]:
rows = []
for qid, docs in qrels.items():
    for doc_id, rel in docs.items():
        rows.append({
            "query_id": qid,
            "doc_id": doc_id,
            "relevance": rel
        })

df_qrels = pd.DataFrame(rows)
df_qrels

Unnamed: 0,query_id,doc_id,relevance
0,1,31715818,1
1,3,14717500,1
2,5,13734012,1
3,13,1606628,1
4,36,5152028,1
...,...,...,...
334,1379,17450673,1
335,1382,17755060,1
336,1385,306006,1
337,1389,23895668,1


In [9]:
# Elegimos una query cualquiera que tenga varios documentos relevantes
qid = "133"

print("Query:")
print(df_queries.loc[df_queries["query_id"] == qid, "query"].values[0])

print("\nDocumentos relevantes para esta query:")
df_qrels[(df_qrels["query_id"] == qid) & (df_qrels["relevance"] > 0)]

Query:
Assembly of invadopodia is triggered by focal generation of phosphatidylinositol-3,4-biphosphate and the activation of the nonreceptor tyrosine kinase Src.

Documentos relevantes para esta query:


Unnamed: 0,query_id,doc_id,relevance
31,133,38485364,1
32,133,6969753,1
33,133,17934082,1
34,133,16280642,1
35,133,12640810,1


## Parte 2. Retrieval inicial (baseline)
- Implementar retrieval inicial con BM25
- Obtener métricas: Recall@10 nDCG@10

In [10]:
#Instalamos la librería para BM25
!pip install rank_bm25


Collecting rank_bm25
  Downloading rank_bm25-0.2.2-py3-none-any.whl.metadata (3.2 kB)
Downloading rank_bm25-0.2.2-py3-none-any.whl (8.6 kB)
Installing collected packages: rank_bm25
Successfully installed rank_bm25-0.2.2


In [11]:
#Importar Librerias
from rank_bm25 import BM25Okapi
from sklearn.preprocessing import LabelEncoder
import numpy as np

### Preparamos el corpus para BM25

In [13]:
#Tokenizar cada documento
tokenized_corpus = [doc.split() for doc in df_corpus['text']]

#Crear objeto BM25
bm25 = BM25Okapi(tokenized_corpus)

### Retrieval para cada query

In [14]:
def retrieve_top_k(query_text, k=10):
    query_tokens = query_text.split()
    doc_scores = bm25.get_scores(query_tokens)
    top_indices = np.argsort(doc_scores)[::-1][:k]  # indices de los top-k documentos
    top_doc_ids = df_corpus.iloc[top_indices]['doc_id'].values
    top_scores = doc_scores[top_indices]
    return list(zip(top_doc_ids, top_scores))


### Evaluacion

In [15]:
from collections import defaultdict

def compute_metrics(df_qrels, df_queries, k=10):
    recalls = []
    ndcgs = []

    for _, row in df_queries.iterrows():
        qid = row['query_id']
        query_text = row['query']

        #Documentos relevantes para esta query
        relevant_docs = df_qrels[(df_qrels['query_id'] == qid) & (df_qrels['relevance'] > 0)]['doc_id'].tolist()
        if not relevant_docs:
            continue

        #Recuperamos top-k documentos con BM25
        retrieved_docs = [doc_id for doc_id, _ in retrieve_top_k(query_text, k=k)]

        #Recall
        num_relevant_retrieved = len(set(retrieved_docs) & set(relevant_docs))
        recall = num_relevant_retrieved / len(relevant_docs)
        recalls.append(recall)

        #nDCG
        dcg = 0.0
        for i, doc_id in enumerate(retrieved_docs):
            if doc_id in relevant_docs:
                dcg += 1 / np.log2(i + 2)  # i+2 porque log2(1) = 0
        #IDCG
        ideal_dcg = sum(1 / np.log2(i + 2) for i in range(min(len(relevant_docs), k)))
        ndcg = dcg / ideal_dcg if ideal_dcg > 0 else 0
        ndcgs.append(ndcg)

    return np.mean(recalls), np.mean(ndcgs)

### Calcular Recall@10 y nDCG@10

In [16]:
recall_10, ndcg_10 = compute_metrics(df_qrels, df_queries, k=10)
print(f"Recall@10: {recall_10:.4f}")
print(f"nDCG@10: {ndcg_10:.4f}")


Recall@10: 0.6281
nDCG@10: 0.5070


## Parte 3. Implementación del re-ranking cross-encoder
- Re-rankear los top-k candidatos para cada query.
- Identificar qué documentos cambian de posición en el top 10

In [17]:
!pip install sentence-transformers



In [18]:
#Importar Librerias
from sentence_transformers import CrossEncoder
import pandas as pd
import numpy as np
from tqdm import tqdm



### Cargar el modelo Cross-Encoder pre-entrenado

In [19]:
# Este modelo calcula la relevancia de un par (query, documento)
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')  # modelo rápido

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/794 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

### Función para re-rankear top-k documentos

In [20]:
def rerank_top_k(query_text, top_docs, k=10):

    if not top_docs:
        return []

    #Crear los pares (query, documento_texto)
    doc_texts = [df_corpus.loc[df_corpus['doc_id'] == doc_id, 'text'].values[0] for doc_id, _ in top_docs]
    pairs = [(query_text, doc_text) for doc_text in doc_texts]

    #Obtener scores
    cross_scores = cross_encoder.predict(pairs)

    #Re-rank según Cross-Encoder
    reranked = sorted(zip([doc_id for doc_id, _ in top_docs], cross_scores), key=lambda x: x[1], reverse=True)

    #Limitar top-k
    return reranked[:k]

### Re-rankear todos los queries del dataset

In [21]:
k = 10
bm25_topk_dict = {}

for _, row in tqdm(df_queries.iterrows(), total=len(df_queries)):
    qid = row['query_id']
    query_text = row['query']

    #Recuperar top-k BM25
    top_k_bm25 = retrieve_top_k(query_text, k=k)
    bm25_topk_dict[qid] = top_k_bm25

#Aplicar re-ranking
reranked_topk_dict = {}
changes_dict = {}

for qid, top_docs in tqdm(bm25_topk_dict.items()):
    reranked_docs = rerank_top_k(df_queries.loc[df_queries['query_id'] == qid, 'query'].values[0], top_docs, k=k)
    reranked_topk_dict[qid] = reranked_docs

    #Identificar cambios
    original_order = [doc_id for doc_id, _ in top_docs]
    new_order = [doc_id for doc_id, _ in reranked_docs]
    changes = [doc_id for doc_id, orig_pos in zip(original_order, range(k)) if doc_id in new_order and original_order.index(doc_id) != new_order.index(doc_id)]
    changes_dict[qid] = changes




100%|██████████| 300/300 [00:10<00:00, 27.76it/s]
100%|██████████| 300/300 [12:11<00:00,  2.44s/it]


### Resultados

In [22]:
example_qid = "133"
print("Query:", df_queries.loc[df_queries["query_id"] == example_qid, "query"].values[0])
print("\nTop-10 BM25:")
print([doc_id for doc_id, _ in bm25_topk_dict[example_qid]])
print("\nTop-10 Re-rank Cross-Encoder:")
print([doc_id for doc_id, _ in reranked_topk_dict[example_qid]])
print("\nDocumentos que cambiaron de posición en top-10:")
print(changes_dict[example_qid])

Query: Assembly of invadopodia is triggered by focal generation of phosphatidylinositol-3,4-biphosphate and the activation of the nonreceptor tyrosine kinase Src.

Top-10 BM25:
['26688294', '37964706', '5270265', '12785130', '45764440', '9507605', '5821617', '23076291', '29073751', '4399311']

Top-10 Re-rank Cross-Encoder:
['9507605', '4399311', '23076291', '37964706', '29073751', '5821617', '45764440', '12785130', '26688294', '5270265']

Documentos que cambiaron de posición en top-10:
['26688294', '37964706', '5270265', '12785130', '45764440', '9507605', '5821617', '23076291', '29073751', '4399311']


## Parte 4. Implementación del re-ranking LTR
- Re-rankear los top-k candidatos para cada query.
- Identificar qué documentos cambian de posición en el top 10

In [23]:
!pip install lightgbm



In [24]:
#Importar Librerias
import lightgbm as lgb
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import MinMaxScaler
import numpy as np
import pandas as pd
from tqdm import tqdm

### Feature Engineering para LTR

In [25]:
#BM25
def bm25_score(query_text, doc_text):
    query_tokens = query_text.split()
    doc_tokens = doc_text.split()
    return BM25Okapi([doc_tokens]).get_scores(query_tokens)[0]

#TF-IDF
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(df_corpus['text'])

def tfidf_cosine_sim(query_text, doc_idx):
    query_vec = tfidf_vectorizer.transform([query_text])
    doc_vec = tfidf_matrix[doc_idx]
    return (query_vec @ doc_vec.T).toarray()[0][0]

#Longitud del documento
df_corpus['doc_length'] = df_corpus['text'].apply(lambda x: len(x.split()))

### Preparar datos de entrenamiento para LTR

In [26]:
rows = []
for qid, docs in tqdm(bm25_topk_dict.items()):
    query_text = df_queries.loc[df_queries['query_id'] == qid, 'query'].values[0]
    relevant_docs = df_qrels[(df_qrels['query_id'] == qid) & (df_qrels['relevance'] > 0)]['doc_id'].tolist()

    for rank, (doc_id, bm25_s) in enumerate(docs):
        doc_idx = df_corpus.index[df_corpus['doc_id'] == doc_id][0]
        rows.append({
            'query_id': qid,
            'doc_id': doc_id,
            'bm25': bm25_s,
            'tfidf': tfidf_cosine_sim(query_text, doc_idx),
            'doc_length': df_corpus.loc[doc_idx, 'doc_length'],
            'relevance': 1 if doc_id in relevant_docs else 0
        })

df_ltr = pd.DataFrame(rows)

100%|██████████| 300/300 [00:06<00:00, 47.61it/s]


### Entrenamiento de un modelo LTR

In [27]:
features = ['bm25', 'tfidf', 'doc_length']
X = df_ltr[features]
y = df_ltr['relevance']

#Features para LightGBM
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

#Dataset LightGBM
train_data = lgb.Dataset(X_scaled, label=y, group=df_ltr.groupby('query_id').size().to_list())

params = {
    'objective': 'lambdarank',
    'metric': 'ndcg',
    'ndcg_eval_at': [10],
    'learning_rate': 0.1,
    'num_leaves': 31,
    'min_data_in_leaf': 1
}

ltr_model = lgb.train(params, train_data, num_boost_round=100)

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000303 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 761
[LightGBM] [Info] Number of data points in the train set: 3000, number of used features: 3


### Re-rankear top-kpara cada query

In [28]:
ltr_topk_dict = {}
ltr_changes_dict = {}

for qid, top_docs in tqdm(bm25_topk_dict.items()):
    query_text = df_queries.loc[df_queries['query_id'] == qid, 'query'].values[0]

    #Calcular features de cada doc
    rows_features = []
    for doc_id, bm25_s in top_docs:
        doc_idx = df_corpus.index[df_corpus['doc_id'] == doc_id][0]
        rows_features.append([
            bm25_s,
            tfidf_cosine_sim(query_text, doc_idx),
            df_corpus.loc[doc_idx, 'doc_length']
        ])

    X_topk_scaled = scaler.transform(rows_features)
    ltr_scores = ltr_model.predict(X_topk_scaled)

    reranked = sorted(zip([doc_id for doc_id, _ in top_docs], ltr_scores), key=lambda x: x[1], reverse=True)
    ltr_topk_dict[qid] = reranked

    #Identificar cambios
    original_order = [doc_id for doc_id, _ in top_docs]
    new_order = [doc_id for doc_id, _ in reranked]
    changes = [doc_id for doc_id in original_order if doc_id in new_order and original_order.index(doc_id) != new_order.index(doc_id)]
    ltr_changes_dict[qid] = changes

100%|██████████| 300/300 [00:16<00:00, 17.99it/s]


### Resultados

In [29]:
example_qid = "133"
print("Query:", df_queries.loc[df_queries["query_id"] == example_qid, "query"].values[0])
print("\nTop-10 BM25:")
print([doc_id for doc_id, _ in bm25_topk_dict[example_qid]])
print("\nTop-10 LTR Re-rank:")
print([doc_id for doc_id, _ in ltr_topk_dict[example_qid]])
print("\nDocumentos que cambiaron de posición en top-10:")
print(ltr_changes_dict[example_qid])


Query: Assembly of invadopodia is triggered by focal generation of phosphatidylinositol-3,4-biphosphate and the activation of the nonreceptor tyrosine kinase Src.

Top-10 BM25:
['26688294', '37964706', '5270265', '12785130', '45764440', '9507605', '5821617', '23076291', '29073751', '4399311']

Top-10 LTR Re-rank:
['9507605', '26688294', '12785130', '45764440', '5270265', '29073751', '4399311', '5821617', '23076291', '37964706']

Documentos que cambiaron de posición en top-10:
['26688294', '37964706', '5270265', '12785130', '45764440', '9507605', '5821617', '23076291', '29073751', '4399311']


## Parte 5. Evaluación post re-ranking

Calcular métricas:
- nDCG@10
- MAP
- Recall@10

### Funcion para evaluacion pos re-ranking

In [30]:
import numpy as np
from sklearn.metrics import average_precision_score

def evaluate_ranking(topk_dict, df_qrels, df_queries, k=10):
    recalls = []
    ndcgs = []
    aps = []

    for _, row in df_queries.iterrows():
        qid = row['query_id']

        #Documentos relevantes
        relevant_docs = df_qrels[(df_qrels['query_id'] == qid) & (df_qrels['relevance'] > 0)]['doc_id'].tolist()
        if not relevant_docs:
            continue

        #Documentos recuperados
        retrieved_docs = [doc_id for doc_id, _ in topk_dict[qid][:k]]

        #Recall@k
        num_relevant_retrieved = len(set(retrieved_docs) & set(relevant_docs))
        recall = num_relevant_retrieved / len(relevant_docs)
        recalls.append(recall)

        #nDCG@k
        dcg = 0.0
        for i, doc_id in enumerate(retrieved_docs):
            if doc_id in relevant_docs:
                dcg += 1 / np.log2(i + 2)
        ideal_dcg = sum(1 / np.log2(i + 2) for i in range(min(len(relevant_docs), k)))
        ndcg = dcg / ideal_dcg if ideal_dcg > 0 else 0
        ndcgs.append(ndcg)

        #MAP
        y_true = [1 if doc_id in relevant_docs else 0 for doc_id in retrieved_docs]
        y_score = np.arange(len(retrieved_docs), 0, -1)  # scores decrecientes artificiales
        aps.append(average_precision_score(y_true, y_score))

    return np.mean(ndcgs), np.mean(aps), np.mean(recalls)

### Evaluacion BM25

In [31]:
ndcg_bm25, map_bm25, recall_bm25 = evaluate_ranking(bm25_topk_dict, df_qrels, df_queries, k=10)
print("BM25:")
print(f"nDCG@10: {ndcg_bm25:.4f}, MAP: {map_bm25:.4f}, Recall@10: {recall_bm25:.4f}\n")



BM25:
nDCG@10: 0.5070, MAP: 0.4743, Recall@10: 0.6281





### Evaluacion Cross-Encoder

In [32]:
ndcg_ce, map_ce, recall_ce = evaluate_ranking(reranked_topk_dict, df_qrels, df_queries, k=10)
print("Cross-Encoder Re-ranking:")
print(f"nDCG@10: {ndcg_ce:.4f}, MAP: {map_ce:.4f}, Recall@10: {recall_ce:.4f}\n")



Cross-Encoder Re-ranking:
nDCG@10: 0.5649, MAP: 0.5492, Recall@10: 0.6281





### Evaluacion LTR

In [33]:
ndcg_ltr, map_ltr, recall_ltr = evaluate_ranking(ltr_topk_dict, df_qrels, df_queries, k=10)
print("LTR Re-ranking:")
print(f"nDCG@10: {ndcg_ltr:.4f}, MAP: {map_ltr:.4f}, Recall@10: {recall_ltr:.4f}\n")



LTR Re-ranking:
nDCG@10: 0.6316, MAP: 0.6433, Recall@10: 0.6281





### Comparacion

In [34]:
print("Resumen de métricas:")
metrics_summary = pd.DataFrame({
    "Model": ["BM25", "Cross-Encoder", "LTR"],
    "nDCG@10": [ndcg_bm25, ndcg_ce, ndcg_ltr],
    "MAP": [map_bm25, map_ce, map_ltr],
    "Recall@10": [recall_bm25, recall_ce, recall_ltr]
})
print(metrics_summary)


Resumen de métricas:
           Model   nDCG@10       MAP  Recall@10
0           BM25  0.506984  0.474284   0.628056
1  Cross-Encoder  0.564935  0.549169   0.628056
2            LTR  0.631599  0.643333   0.628056
