# Ejercicio 10: Re-ranking

**Objetivo:** Implementar y evaluar un pipeline de Recuperación de Información en dos etapas, y analizar el impacto del re-ranking en la calidad del ranking.

## Parte 1. Preparación del corpus

* Cargar el corpus (documentos/pasajes).
* Cargar las consultas (queries).
* Cargar qrels (relevancia).

In [1]:
from beir import util
from beir.datasets.data_loader import GenericDataLoader
import pandas as pd

  from tqdm.autonotebook import tqdm


In [2]:
DATASET_NAME = "scifact"
DATA_DIR = "../data/beir_datasets"
url = f"https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{DATASET_NAME}.zip"
util.download_and_unzip(url, DATA_DIR)

../data/beir_datasets\scifact.zip:   0%|          | 0.00/2.69M [00:00<?, ?iB/s]

'../data/beir_datasets\\scifact'

In [3]:
dataset_path = DATA_DIR + "/" + DATASET_NAME
corpus, queries, qrels = GenericDataLoader(dataset_path).load(split="test")

  0%|          | 0/5183 [00:00<?, ?it/s]

In [4]:
df_corpus = (
    pd.DataFrame.from_dict(corpus, orient="index")
      .reset_index()
      .rename(columns={"index": "doc_id"})
)

df_corpus

Unnamed: 0,doc_id,text,title
0,4983,Alterations of the architecture of cerebral wh...,Microstructural development of human newborn c...
1,5836,Myelodysplastic syndromes (MDS) are age-depend...,Induction of myelodysplasia by myeloid-derived...
2,7912,ID elements are short interspersed elements (S...,"BC1 RNA, the transcript from a master gene for..."
3,18670,DNA methylation plays an important role in bio...,The DNA Methylome of Human Peripheral Blood Mo...
4,19238,Two human Golli (for gene expressed in the oli...,The human myelin basic protein gene is include...
...,...,...,...
5178,195689316,BACKGROUND The main associations of body-mass ...,Body-mass index and cause-specific mortality i...
5179,195689757,A key aberrant biological difference between t...,Targeting metabolic remodeling in glioblastoma...
5180,196664003,A signaling pathway transmits information from...,Signaling architectures that transmit unidirec...
5181,198133135,AIMS Trabecular bone score (TBS) is a surrogat...,"Association between pre-diabetes, type 2 diabe..."


In [5]:
df_queries = (
    pd.DataFrame.from_dict(queries, orient="index", columns=["query"])
      .reset_index()
      .rename(columns={"index": "query_id"})
)

df_queries

Unnamed: 0,query_id,query
0,1,0-dimensional biomaterials show inductive prop...
1,3,"1,000 genomes project enables mapping of genet..."
2,5,1/2000 in UK have abnormal PrP positivity.
3,13,5% of perinatal mortality is due to low birth ...
4,36,A deficiency of vitamin B12 increases blood le...
...,...,...
295,1379,Women with a higher birth weight are more like...
296,1382,aPKCz causes tumour enhancement by affecting g...
297,1385,cSMAC formation enhances weak ligand signalling.
298,1389,mTORC2 regulates intracellular cysteine levels...


In [6]:
rows = []
for qid, docs in qrels.items():
    for doc_id, rel in docs.items():
        rows.append({
            "query_id": qid,
            "doc_id": doc_id,
            "relevance": rel
        })

df_qrels = pd.DataFrame(rows)
df_qrels

Unnamed: 0,query_id,doc_id,relevance
0,1,31715818,1
1,3,14717500,1
2,5,13734012,1
3,13,1606628,1
4,36,5152028,1
...,...,...,...
334,1379,17450673,1
335,1382,17755060,1
336,1385,306006,1
337,1389,23895668,1


In [7]:
# Elegimos una query cualquiera que tenga varios documentos relevantes
qid = "133"

print("Query:")
print(df_queries.loc[df_queries["query_id"] == qid, "query"].values[0])

print("\nDocumentos relevantes para esta query:")
df_qrels[(df_qrels["query_id"] == qid) & (df_qrels["relevance"] > 0)]

Query:
Assembly of invadopodia is triggered by focal generation of phosphatidylinositol-3,4-biphosphate and the activation of the nonreceptor tyrosine kinase Src.

Documentos relevantes para esta query:


Unnamed: 0,query_id,doc_id,relevance
31,133,38485364,1
32,133,6969753,1
33,133,17934082,1
34,133,16280642,1
35,133,12640810,1


## Parte 2. Retrieval inicial (baseline)

* Implementar retrieval inicial con BM25
* Obtener métricas: Recall@10 nDCG@10

In [8]:
from rank_bm25 import BM25Okapi
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
import re
import numpy as np

def tokenizar(texto):
    # Quitamos los caracteres no alfabéticos y convertimos a minúsculas
    tokens = re.findall(r'\b\w+\b', texto.lower())
    # Quitamos stopwords
    tokens = [t for t in tokens if t not in ENGLISH_STOP_WORDS]
    return tokens

# Tokenizamos todos los documentos del corpus
corpus_tokenizado = []
for doc_id, doc_data in corpus.items():
    texto_completo = doc_data['title'] + " " + doc_data['text']
    corpus_tokenizado.append(tokenizar(texto_completo))

# Creamos el modelo BM25
bm25 = BM25Okapi(corpus_tokenizado)

In [9]:
# Función para obtener ranking BM25 
def obtener_ranking_bm25(query_text, bm25, corpus, top_k=10):
    """
    Retorna el ranking de documentos para una query usando BM25
    """
    query_tokenizada = tokenizar(query_text)
    scores = bm25.get_scores(query_tokenizada)
    
    # Obtenemos los índices ordenados por score 
    doc_ids = list(corpus.keys())
    ranking = sorted(zip(doc_ids, scores), key=lambda x: x[1], reverse=True)
    
    return ranking[:top_k]

# Probamos con la query definida 
query_id = "133" 
query_text = queries[query_id]

print(f"Query: {query_text}\n")

ranking_bm25 = obtener_ranking_bm25(query_text, bm25, corpus, top_k=10)

print("Top 10 documentos según BM25:")
for i, (doc_id, score) in enumerate(ranking_bm25, 1):
    print(f"{i}. Doc {doc_id} - Score: {score:.4f}")
    print(f"   Título: {corpus[doc_id]['title'][:80]}...")
    print()

Query: Assembly of invadopodia is triggered by focal generation of phosphatidylinositol-3,4-biphosphate and the activation of the nonreceptor tyrosine kinase Src.

Top 10 documentos según BM25:
1. Doc 16280642 - Score: 27.9590
   Título: Sequential signals toward podosome formation in NIH-src cells...

2. Doc 19752008 - Score: 27.0219
   Título: A specific inhibitor of phosphatidylinositol 3-kinase, 2-(4-morpholinyl)-8-pheny...

3. Doc 5270265 - Score: 26.9093
   Título: Combating trastuzumab resistance by targeting SRC, a common node downstream of m...

4. Doc 45764440 - Score: 25.5834
   Título: Inhibition of SRC expression and activity inhibits tumor progression and metasta...

5. Doc 26688294 - Score: 24.7242
   Título: Schizophrenia susceptibility pathway neuregulin 1–ErbB4 suppresses Src upregulat...

6. Doc 1782201 - Score: 22.8996
   Título: Integrin αvβ3/c-src “Oncogenic Unit” Promotes Anchorage-independence and Tumor P...

7. Doc 12785130 - Score: 22.4927
   Título: The regul

In [10]:
# Calculamos métricas baseline 
def calcular_metricas_baseline(queries, qrels, bm25, corpus, k=10):
    """
    Calcula Recall@k y nDCG@k para todas las queries
    """
    from sklearn.metrics import ndcg_score
    
    recalls = []
    ndcgs = []
    
    for query_id, query_text in queries.items():
        if query_id not in qrels:
            continue
            
        # Documentos relevantes para la query
        docs_relevantes = set(qrels[query_id].keys())
        
        # Obtenemos ranking BM25
        ranking = obtener_ranking_bm25(query_text, bm25, corpus, top_k=k)
        docs_recuperados = [doc_id for doc_id, _ in ranking]
        
        # Recall@k
        docs_relevantes_recuperados = set(docs_recuperados) & docs_relevantes
        recall = len(docs_relevantes_recuperados) / len(docs_relevantes) if docs_relevantes else 0
        recalls.append(recall)
        
        # Creamos el vector de relevancia 
        relevancia_real = [qrels[query_id].get(doc_id, 0) for doc_id in docs_recuperados]
        relevancia_ideal = sorted(relevancia_real, reverse=True)
        
        if sum(relevancia_real) > 0:
            ndcg = ndcg_score([relevancia_ideal], [relevancia_real])
            ndcgs.append(ndcg)
    
    return {
        'recall@10': np.mean(recalls),
        'ndcg@10': np.mean(ndcgs)
    }

# Calculamos las métricas baseline
metricas_baseline = calcular_metricas_baseline(queries, qrels, bm25, corpus, k=10)

print("Métricas Baseline (BM25):")
print(f"Recall@10: {metricas_baseline['recall@10']:.4f}")
print(f"nDCG@10: {metricas_baseline['ndcg@10']:.4f}")

Métricas Baseline (BM25):
Recall@10: 0.7939
nDCG@10: 0.7991


## Parte 3. Implementación del re-ranking _cross-encoder_

* Re-rankear los top-k candidatos para cada query.
* Identificar qué documentos cambian de posición en el top 10

In [11]:
from sentence_transformers import CrossEncoder

# Cargamos el modelo cross-encoder
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

def reranking_cross_encoder(query_text, ranking_inicial, corpus, cross_encoder, top_k=10):
    """
    Re-rankea los documentos usando un cross-encoder
    """
    pares = []
    doc_ids = []
    
    for doc_id, _ in ranking_inicial:
        doc_text = corpus[doc_id]['title'] + " " + corpus[doc_id]['text']
        doc_text = doc_text[:500]
        pares.append([query_text, doc_text])
        doc_ids.append(doc_id)
    
    # Obtenemos scores del cross-encoder
    scores_ce = cross_encoder.predict(pares)
    
    # Creamos nuevo ranking
    ranking_reranked = sorted(zip(doc_ids, scores_ce), key=lambda x: x[1], reverse=True)
    
    return ranking_reranked[:top_k]

# Probamos con la query definida 
query_id = "133"
query_text = queries[query_id]

# Obtenemos ranking inicial BM25 (top-k candidatos)
ranking_bm25_inicial = obtener_ranking_bm25(query_text, bm25, corpus, top_k=100)

# Re-rankeamos con cross-encoder
ranking_reranked = reranking_cross_encoder(query_text, ranking_bm25_inicial, corpus, cross_encoder, top_k=10)

# Identificar qué documentos cambian de posición en el top 10
print(f"Query: {query_text}\n")
print("Cambios en el Top 10:")
print(f"{'Pos Nueva':<10} {'Doc ID':<12} {'Pos BM25':<10} {'Cambio'}")
print("-"*50)

# Crear mapeo de posiciones BM25
pos_bm25 = {doc_id: i+1 for i, (doc_id, _) in enumerate(ranking_bm25_inicial[:10])}

for i, (doc_id, score) in enumerate(ranking_reranked, 1):
    pos_anterior = pos_bm25.get(doc_id, ">10")
    if isinstance(pos_anterior, int):
        cambio = pos_anterior - i
        cambio_str = f"+{cambio}" if cambio > 0 else str(cambio) if cambio < 0 else "="
    else:
        cambio_str = "NUEVO"
    print(f"{i:<10} {doc_id:<12} {str(pos_anterior):<10} {cambio_str}")

Query: Assembly of invadopodia is triggered by focal generation of phosphatidylinositol-3,4-biphosphate and the activation of the nonreceptor tyrosine kinase Src.

Cambios en el Top 10:
Pos Nueva  Doc ID       Pos BM25   Cambio
--------------------------------------------------
1          12640810     8          +7
2          14328288     >10        NUEVO
3          35660758     >10        NUEVO
4          19752008     2          -2
5          86694016     >10        NUEVO
6          16280642     1          -5
7          5403286      >10        NUEVO
8          14819804     >10        NUEVO
9          9063688      >10        NUEVO
10         21295300     >10        NUEVO


## Parte 4. Implementación del re-ranking _LTR_

* Re-rankear los top-k candidatos para cada query.
* Identificar qué documentos cambian de posición en el top 10

In [12]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

def obtener_caracteristicas(query_text, doc_text, bm25_score):
    """
    Extrae features para el modelo LTR
    """
    query_tokens = set(tokenizar(query_text))
    doc_tokens = set(tokenizar(doc_text))
    
    features = []
    
    # BM25 score
    features.append(bm25_score)
    
    terminos_comunes = len(query_tokens & doc_tokens)
    features.append(terminos_comunes)
    
    if len(query_tokens) > 0:
        features.append(terminos_comunes / len(query_tokens))
    else:
        features.append(0)
    
    # Longitud del documento 
    features.append(np.log(len(doc_tokens) + 1))
    
    # Longitud del título
    features.append(len(doc_text.split()[:20]))  # Aproximación del título
    
    return features



In [13]:
def preparar_datos_entrenamiento(queries, qrels, bm25, corpus, top_k=100):
    """
    Prepara datos de entrenamiento para LTR
    """
    X = []
    y = []
    
    for query_id, query_text in queries.items():
        if query_id not in qrels:
            continue
        
        # Obtenemos ranking BM25
        ranking = obtener_ranking_bm25(query_text, bm25, corpus, top_k=top_k)
        
        for doc_id, bm25_score in ranking:
            doc_text = corpus[doc_id]['title'] + " " + corpus[doc_id]['text']
            doc_text = doc_text[:500]
            
           
            features = obtener_caracteristicas(query_text, doc_text, bm25_score)
            X.append(features)
            
            relevancia = qrels[query_id].get(doc_id, 0)
            y.append(relevancia)
    
    return np.array(X), np.array(y)

# Preparamos datos de entrenamiento
X_train, y_train = preparar_datos_entrenamiento(queries, qrels, bm25, corpus, top_k=100)

print(f"Datos de entrenamiento: {X_train.shape}")
print(f"Distribución de relevancia: {np.bincount(y_train.astype(int))}")

# Entrenamos el modelo LTR
print("\nEntrenando modelo LTR...")
ltr_model = RandomForestRegressor(n_estimators=100, random_state=42)
ltr_model.fit(X_train, y_train)
print("Modelo entrenado")



Datos de entrenamiento: (30000, 5)
Distribución de relevancia: [29699   301]

Entrenando modelo LTR...
Modelo entrenado


In [14]:
def reranking_ltr(query_text, ranking_inicial, corpus, bm25, ltr_model, top_k=10):
    """
    Re-rankea los documentos usando LTR
    """
    doc_ids = []
    features_list = []
    
    for doc_id, bm25_score in ranking_inicial:
        doc_text = corpus[doc_id]['title'] + " " + corpus[doc_id]['text']
        doc_text = doc_text[:500]
        
        
        features = obtener_caracteristicas(query_text, doc_text, bm25_score)
        features_list.append(features)
        doc_ids.append(doc_id)
    
    # Predecimos scores con LTR
    X = np.array(features_list)
    scores_ltr = ltr_model.predict(X)
    
    # Creamos nuevo ranking
    ranking_reranked = sorted(zip(doc_ids, scores_ltr), key=lambda x: x[1], reverse=True)
    
    return ranking_reranked[:top_k]

In [15]:
# Probamos con la query definida 
query_id = "133"
query_text = queries[query_id]

# Ranking inicial BM25
ranking_bm25_inicial = obtener_ranking_bm25(query_text, bm25, corpus, top_k=100)

# Re-rankeamos con LTR
ranking_ltr = reranking_ltr(query_text, ranking_bm25_inicial, corpus, bm25, ltr_model, top_k=10)

# Identificamos qué documentos cambian de posición en el top 10
print(f"\nQuery: {query_text}\n")
print("Cambios en el Top 10 (LTR):")
print(f"{'Pos Nueva':<10} {'Doc ID':<12} {'Pos BM25':<10} {'Cambio'}")
print("-"*50)

pos_bm25 = {doc_id: i+1 for i, (doc_id, _) in enumerate(ranking_bm25_inicial[:10])}

for i, (doc_id, score) in enumerate(ranking_ltr, 1):
    pos_anterior = pos_bm25.get(doc_id, ">10")
    if isinstance(pos_anterior, int):
        cambio = pos_anterior - i
        cambio_str = f"+{cambio}" if cambio > 0 else str(cambio) if cambio < 0 else "="
    else:
        cambio_str = "NUEVO"
    print(f"{i:<10} {doc_id:<12} {str(pos_anterior):<10} {cambio_str}")


Query: Assembly of invadopodia is triggered by focal generation of phosphatidylinositol-3,4-biphosphate and the activation of the nonreceptor tyrosine kinase Src.

Cambios en el Top 10 (LTR):
Pos Nueva  Doc ID       Pos BM25   Cambio
--------------------------------------------------
1          16280642     1          =
2          12640810     8          +6
3          6969753      >10        NUEVO
4          17934082     >10        NUEVO
5          45764440     4          -1
6          5914739      10         +4
7          5821617      >10        NUEVO
8          26688294     5          -3
9          19752008     2          -7
10         5270265      3          -7


## Parte 5. Evaluación post re-ranking

Calcular métricas:
* nDCG@10
* MAP
* Recall@10

In [16]:
from sklearn.metrics import ndcg_score

def calcular_metricas_completas(queries, qrels, bm25, corpus, cross_encoder, ltr_model, k=10):
    """
    Calcula métricas para BM25, Cross-Encoder y LTR
    """
    metricas = {
        'BM25': {'recall': [], 'ndcg': [], 'map': []},
        'CrossEncoder': {'recall': [], 'ndcg': [], 'map': []},
        'LTR': {'recall': [], 'ndcg': [], 'map': []}
    }
    
    for query_id, query_text in queries.items():
        if query_id not in qrels:
            continue
        
        docs_relevantes = set(qrels[query_id].keys())
        
        # BM25
        ranking_bm25 = obtener_ranking_bm25(query_text, bm25, corpus, top_k=k)
        docs_bm25 = [doc_id for doc_id, _ in ranking_bm25]
        
        # Cross-Encoder
        ranking_bm25_100 = obtener_ranking_bm25(query_text, bm25, corpus, top_k=100)
        ranking_ce = reranking_cross_encoder(query_text, ranking_bm25_100, corpus, cross_encoder, top_k=k)
        docs_ce = [doc_id for doc_id, _ in ranking_ce]
        
        # LTR
        ranking_ltr = reranking_ltr(query_text, ranking_bm25_100, corpus, bm25, ltr_model, top_k=k)
        docs_ltr = [doc_id for doc_id, _ in ranking_ltr]
        
        for metodo, docs_recuperados in [('BM25', docs_bm25), ('CrossEncoder', docs_ce), ('LTR', docs_ltr)]:
            # Recall@k
            docs_rel_recuperados = set(docs_recuperados) & docs_relevantes
            recall = len(docs_rel_recuperados) / len(docs_relevantes) if docs_relevantes else 0
            metricas[metodo]['recall'].append(recall)
            
            # nDCG@k
            relevancia_real = [qrels[query_id].get(doc_id, 0) for doc_id in docs_recuperados]
            relevancia_ideal = sorted(relevancia_real, reverse=True)
            
            if sum(relevancia_real) > 0:
                ndcg = ndcg_score([relevancia_ideal], [relevancia_real])
                metricas[metodo]['ndcg'].append(ndcg)
            
            # MAP 
            precisions = []
            relevantes_encontrados = 0
            for i, doc_id in enumerate(docs_recuperados, 1):
                if doc_id in docs_relevantes:
                    relevantes_encontrados += 1
                    precisions.append(relevantes_encontrados / i)
            
            ap = np.mean(precisions) if precisions else 0
            metricas[metodo]['map'].append(ap)
    
    # Promediar métricas
    resultados = {}
    for metodo in metricas:
        resultados[metodo] = {
            f'Recall@{k}': np.mean(metricas[metodo]['recall']),
            f'nDCG@{k}': np.mean(metricas[metodo]['ndcg']),
            'MAP': np.mean(metricas[metodo]['map'])
        }
    
    return resultados

In [17]:
# Calculo de métricas 
print("Calculando métricas para todos los métodos...\n")
resultados = calcular_metricas_completas(queries, qrels, bm25, corpus, cross_encoder, ltr_model, k=10)


print("="*70)
print(f"{'Método':<20} {'Recall@10':<15} {'nDCG@10':<15} {'MAP':<15}")
print("="*70)

for metodo, metricas in resultados.items():
    print(f"{metodo:<20} {metricas['Recall@10']:<15.4f} {metricas['nDCG@10']:<15.4f} {metricas['MAP']:<15.4f}")

print("="*70)

print("\nMejora respecto a BM25:")
baseline_recall = resultados['BM25']['Recall@10']
baseline_ndcg = resultados['BM25']['nDCG@10']
baseline_map = resultados['BM25']['MAP']

for metodo in ['CrossEncoder', 'LTR']:
    mejora_recall = ((resultados[metodo]['Recall@10'] - baseline_recall) / baseline_recall) * 100
    mejora_ndcg = ((resultados[metodo]['nDCG@10'] - baseline_ndcg) / baseline_ndcg) * 100
    mejora_map = ((resultados[metodo]['MAP'] - baseline_map) / baseline_map) * 100
    
    print(f"\n{metodo}:")
    print(f"  Recall@10: {mejora_recall:+.2f}%")
    print(f"  nDCG@10: {mejora_ndcg:+.2f}%")
    print(f"  MAP: {mejora_map:+.2f}%")

Calculando métricas para todos los métodos...

Método               Recall@10       nDCG@10         MAP            
BM25                 0.7939          0.7991          0.6321         
CrossEncoder         0.7263          0.7787          0.5564         
LTR                  0.8859          1.0000          0.8933         

Mejora respecto a BM25:

CrossEncoder:
  Recall@10: -8.52%
  nDCG@10: -2.56%
  MAP: -11.98%

LTR:
  Recall@10: +11.58%
  nDCG@10: +25.13%
  MAP: +41.34%
