# Sparse Retrieval: BM25

### CLEF 2025 - CheckThat! Lab  - Task 4 Scientific Web Discourse - Subtask 4b (Scientific Claim Source Retrieval)

This notebook enables to get an idea of the power of Sparse Retrieval baseline model for the subtask 4b. It includes the following:
- Code to upload data, including:
    - code to upload the collection set (CORD-19 academic papers' metadata)
    - code to upload the query set (tweets with implicit references to CORD-19 papers)
- Code to run a baseline retrieval model (BM25)
- Code to evaluate the baseline model

# 1) Importing data

In [None]:
import numpy as np
import pandas as pd

import ast
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
from numpy.random import RandomState
import random
# !pip install rank_bm25
from rank_bm25 import BM25Okapi
import spacy
from sklearn.metrics import ndcg_score

In [2]:
# Set random seeds for reproducibility
random.seed(42)
np.random.seed(42)

## 1.a) Import the collection set
The collection set contains metadata of CORD-19 academic papers.

The preprocessed and filtered CORD-19 dataset is available on the Gitlab repository here: https://gitlab.com/checkthat_lab/clef2025-checkthat-lab/-/tree/main/task4/subtask_4b

In [None]:
# 1) Download the collection set from the Gitlab repository: https://gitlab.com/checkthat_lab/clef2025-checkthat-lab/-/tree/main/task4/subtask_4b
# 2) Drag and drop the downloaded file to the "data" section (left vertical menu on Colab)
# 3) Modify the path to your local file path
PATH_COLLECTION_DATA = '../../data/subtask4b_collection_data.pkl'


In [4]:
df_collection = pd.read_pickle(PATH_COLLECTION_DATA)

In [5]:
df_collection.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7718 entries, 162 to 1056448
Data columns (total 17 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   cord_uid          7718 non-null   object        
 1   source_x          7718 non-null   object        
 2   title             7718 non-null   object        
 3   doi               7677 non-null   object        
 4   pmcid             4959 non-null   object        
 5   pubmed_id         6233 non-null   object        
 6   license           7718 non-null   object        
 7   abstract          7718 non-null   object        
 8   publish_time      7715 non-null   object        
 9   authors           7674 non-null   object        
 10  journal           6668 non-null   object        
 11  mag_id            0 non-null      float64       
 12  who_covidence_id  528 non-null    object        
 13  arxiv_id          20 non-null     object        
 14  label             7718 n

In [6]:
df_collection.head()

Unnamed: 0,cord_uid,source_x,title,doi,pmcid,pubmed_id,license,abstract,publish_time,authors,journal,mag_id,who_covidence_id,arxiv_id,label,time,timet
162,umvrwgaw,PMC,Professional and Home-Made Face Masks Reduce E...,10.1371/journal.pone.0002618,PMC2440799,18612429,cc-by,BACKGROUND: Governments are preparing for a po...,2008-07-09,"van der Sande, Marianne; Teunis, Peter; Sabel,...",PLoS One,,,,umvrwgaw,2008-07-09,1215561600
611,spiud6ok,PMC,The Failure of R (0),10.1155/2011/527610,PMC3157160,21860658,cc-by,"The basic reproductive ratio, R (0), is one of...",2011-08-16,"Li, Jing; Blakeley, Daniel; Smith?, Robert J.",Comput Math Methods Med,,,,spiud6ok,2011-08-16,1313452800
918,aclzp3iy,PMC,Pulmonary sequelae in a patient recovered from...,10.4103/0970-2113.99118,PMC3424870,22919170,cc-by-nc-sa,The pandemic of swine flu (H1N1) influenza spr...,2012,"Singh, Virendra; Sharma, Bharat Bhushan; Patel...",Lung India,,,,aclzp3iy,2012-01-01,1325376000
993,ycxyn2a2,PMC,What was the primary mode of smallpox transmis...,10.3389/fcimb.2012.00150,PMC3509329,23226686,cc-by,The mode of infection transmission has profoun...,2012-11-29,"Milton, Donald K.",Front Cell Infect Microbiol,,,,ycxyn2a2,2012-11-29,1354147200
1053,zxe95qy9,PMC,"Lessons from the History of Quarantine, from P...",10.3201/eid1902.120312,PMC3559034,23343512,no-cc,"In the new millennium, the centuries-old strat...",2013-02-03,"Tognotti, Eugenia",Emerg Infect Dis,,,,zxe95qy9,2013-02-03,1359849600


## 1.b) Import the query set

The query set contains tweets with implicit references to academic papers from the collection set.

The preprocessed query set is available on the Gitlab repository here: https://gitlab.com/checkthat_lab/clef2025-checkthat-lab/-/tree/main/task4/subtask_4b

In [None]:
# 1) Download the query tweets from the Gitlab repository: https://gitlab.com/checkthat_lab/clef2025-checkthat-lab/-/tree/main/task4/subtask_4b?ref_type=heads
# 2) Drag and drop the downloaded file to the "Files" section (left vertical menu on Colab)
# 3) Modify the path to your local file patPATH_COLLECTION_DATA = '../data/subtask4b_collection_data.pkl'
PATH_QUERY_TRAIN_DATA = '../../data/ubtask4b_query_tweets_train.tsv'
PATH_QUERY_DEV_DATA = '../../data/subtask4b_query_tweets_dev.tsv'

In [8]:
df_query_train = pd.read_csv(PATH_QUERY_TRAIN_DATA, sep = '\t')
df_query_dev = pd.read_csv(PATH_QUERY_DEV_DATA, sep = '\t')

In [9]:
df_query_dev.head()

Unnamed: 0,post_id,tweet_text,cord_uid
0,16,covid recovery: this study from the usa reveal...,3qvh482o
1,69,"""Among 139 clients exposed to two symptomatic ...",r58aohnu
2,73,I recall early on reading that researchers who...,sts48u9i
3,93,You know you're credible when NIH website has ...,3sr2exq9
4,96,Resistance to antifungal medications is a grow...,ybwwmyqy


In [10]:
df_query_train.head()

Unnamed: 0,post_id,tweet_text,cord_uid
0,0,Oral care in rehabilitation medicine: oral vul...,htlvpvz5
1,1,this study isn't receiving sufficient attentio...,4kfl29ul
2,2,"thanks, xi jinping. a reminder that this study...",jtwb17u8
3,3,Taiwan - a population of 23 million has had ju...,0w9k8iy1
4,4,Obtaining a diagnosis of autism in lower incom...,tiqksd69


In [11]:
df_query_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12853 entries, 0 to 12852
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   post_id     12853 non-null  int64 
 1   tweet_text  12853 non-null  object
 2   cord_uid    12853 non-null  object
dtypes: int64(1), object(2)
memory usage: 301.4+ KB


In [12]:
df_query_dev.head()

Unnamed: 0,post_id,tweet_text,cord_uid
0,16,covid recovery: this study from the usa reveal...,3qvh482o
1,69,"""Among 139 clients exposed to two symptomatic ...",r58aohnu
2,73,I recall early on reading that researchers who...,sts48u9i
3,93,You know you're credible when NIH website has ...,3sr2exq9
4,96,Resistance to antifungal medications is a grow...,ybwwmyqy


In [13]:
df_query_dev.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1400 entries, 0 to 1399
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   post_id     1400 non-null   int64 
 1   tweet_text  1400 non-null   object
 2   cord_uid    1400 non-null   object
dtypes: int64(1), object(2)
memory usage: 32.9+ KB


# 2) Running the baseline
The following code runs a BM25 and TF-IDF baseline.


### Theoretical Fundamentals of Sparse Retrieval Baseline BM25

**BM25 (Okapi)**  
$$
\mathrm{score}(q,d)
= \sum_{t \in T_d \cap T_q}
\frac{tf_{t,d}}
     {\underbrace{k_1\Bigl((1-b) + b\,\frac{dl_d}{\mathrm{avgdl}}\Bigr) + tf_{t,d}}_{\text{TF saturation + length norm}}}
\;\times\;
\underbrace{\log\!\frac{\lvert D\rvert - df_t + 0.5}{df_t + 0.5}}_{\text{RSJ IDF}}
$$

where:

- $tf_{t,d}$: frequency of term $t$ in document $d$.  
- $dl_d$: length of $d$ in tokens; $\mathrm{avgdl}$: average document length in the collection.  
- $df_t$: number of documents containing term $t$; $\lvert D\rvert$: total number of documents.  
- $k_1 > 0$: term‐frequency saturation parameter (larger $k_1 \!\to\!$ slower saturation).  
- $b \in [0,1]$: length‐normalisation parameter ($b=0$ disables, $b=1$ full normalisation).


---

### MRR@k (Mean Reciprocal Rank at cutoff *k*)

Let *r* be the rank (1-indexed) of the first relevant document for a given query. Then  
$$
\mathrm{MRR}@k \;=\;
\frac{1}{|Q|}\sum_{q\in Q}
\begin{cases}
\dfrac{1}{r}, & r \le k,\\[6pt]
0,            & r > k.
\end{cases}
$$

- **MRR@1**: only counts if the top result is relevant (score = 1 or 0).  
- **MRR@5**: uses reciprocal ranks (1, 1/2, 1/3, 1/4, 1/5) for positions 1…5; 0 beyond.  
- **MRR@10**: uses reciprocal ranks (1, 1/2, …, 1/10) for positions 1…10.  


**BM25 Baseline Pipeline Overview**



**Preprocessing**  
- Applies spaCy lemmatization and lowercasing  
- Keeps alphabetic (`tok.is_alpha`) and numeric (`tok.like_num`) tokens  
- Removes stopwords (`not tok.is_stop`)  
- Ensures tokens like “studies” → “study” and retains numbers (e.g. “2021”)

**Corpus Creation**  
- Concatenates each document’s `title` + `abstract` into one string → `corpus`  
- Extracts `cord_uids` for mapping results back to documents  

**Tokenization**  
- Transforms every entry in `corpus` via `preprocess()` → `tokenized_corpus` (list of token lists)  

**Index Initialization**  
- Builds the BM25 index:  
  ```python
  bm25 = BM25Okapi(tokenized_corpus)


In [14]:
# Load the small English model, disabling the parser and named-entity recognizer to save memory
nlp = spacy.load("en_core_web_sm", disable=["parser", "ner"])

# def preprocess(text):
#     """
#     Tokenize input text, lemmatize each token, lowercase it,
#     and filter out non-alphabetic tokens and stopwords.
#     """
#     doc = nlp(text)
#     return [
#         tok.lemma_.lower() 
#         for tok in doc 
#         if tok.is_alpha and not tok.is_stop
#     ]

def preprocess(text):
    """
    Tokenize input text, lemmatize each token, lowercase it,
    and filter out tokens that are neither pure words nor numbers,
    plus remove stopwords.
    """
    doc = nlp(text)
    tokens = []
    for tok in doc:
        # keep tokens that are alphabetic OR numeric, but not stopwords
        if (tok.is_alpha or tok.like_num) and not tok.is_stop:
            tokens.append(tok.lemma_.lower())
    return tokens



# 2) Create the corpus (unchanged)
#    - Combine title and abstract into one string per document
corpus = (
    df_collection[['title', 'abstract']]
    .apply(lambda x: f"{x['title']} {x['abstract']}", axis=1)
    .tolist()
)

#    - Keep track of each document’s unique ID
cord_uids = df_collection['cord_uid'].tolist()


# 3) Tokenize with lemmatization & initialize BM25
#    - Apply our preprocess() to every document
tokenized_corpus = [preprocess(doc) for doc in corpus]

#    - Build the BM25 index over the lemmatized tokens
bm25 = BM25Okapi(tokenized_corpus)


include authors and journal -->leads to no improvement in MMR@5

In [15]:
# # Load the small English model, disabling the parser and named-entity recognizer to save memory
# nlp = spacy.load("en_core_web_sm", disable=["parser", "ner"])

# def preprocess(text):
#     """
#     Tokenize input text, lemmatize each token, lowercase it,
#     and filter out non-alphabetic tokens and stopwords.
#     """
#     doc = nlp(text)
#     return [
#         tok.lemma_.lower() 
#         for tok in doc 
#         if tok.is_alpha and not tok.is_stop
#     ]



# # 2) Corpus‑Erstellung mit Title, Abstract, Authors & Journal
# def combine_fields(row):
#     # Falls authors als Liste vorliegt, zu einem String zusammenfügen
#     authors = ", ".join(row['authors']) if isinstance(row['authors'], list) else str(row['authors'])
#     journal = str(row['journal']) if not pd.isna(row['journal']) else ""
#     return f"{row['title']} {row['abstract']} {authors} {journal}"

# corpus = (
#     df_collection[['title','abstract','authors','journal']]
#     .apply(combine_fields, axis=1)
#     .tolist()
# )

# cord_uids = df_collection['cord_uid'].tolist()

# # 3) Tokenisierung & BM25‑Initialisierung bleibt unverändert
# tokenized_corpus = [preprocess(doc) for doc in corpus]
# bm25 = BM25Okapi(tokenized_corpus)


A global dictionary `_text2bm25top` caches mappings from queries to their top‑k document IDs.  
`get_top_cord_uids(query, k=5)` checks `_text2bm25top`; on a cache miss it calls `preprocess(query)`, obtains BM25 scores via `bm25.get_scores()`, selects the top‑k indices with `np.argsort(-scores)[:k]`, looks up the corresponding `cord_uids`, stores the result in `_text2bm25top`, and returns the list of IDs.  


In [16]:
# global Cache
_text2bm25top = {}

def get_top_cord_uids(query, k=10):
    # return cached?
    if query in _text2bm25top:
        return _text2bm25top[query]
    # tokenize with Lemmatization
    tokenized_query = preprocess(query)
    # Score-Calcuation
    doc_scores = bm25.get_scores(tokenized_query)
    indices    = np.argsort(-doc_scores)[:k]
    topk_uids  = [cord_uids[i] for i in indices]
    # Fill cache
    _text2bm25top[query] = topk_uids
    return topk_uids


In [None]:
# train and dev queries
df_query_train['bm25_topk'] = df_query_train['tweet_text'] \
                                .apply(lambda txt: get_top_cord_uids(txt, k=10))
df_query_dev  ['bm25_topk'] = df_query_dev  ['tweet_text'] \
                                .apply(lambda txt: get_top_cord_uids(txt, k=10))

# 3) Evaluating the baseline
The following code evaluates the BM25 retrieval baseline on the train, dev and test query set using the Mean Reciprocal Rank score (MRR@1, MRR@5, MRR@10, Recall@5, Recall@10).

In [18]:
def compute_core_metrics(data, col_gold, col_pred, k=10):
    """
    Compute MRR@{1,5,10} and Recall@{5,10}.
    Any not-found document receives rank k+1 (never counted as a hit).
    """
    ranks = data.apply(
        lambda row:
            (row[col_pred].index(row[col_gold]) + 1)
            if row[col_gold] in row[col_pred]
            else k + 1,       # ← always > k, so never a false hit
        axis=1
    ).to_numpy()

    mrr1  = np.mean([1.0/r if r <= 1  else 0.0 for r in ranks])
    mrr5  = np.mean([1.0/r if r <= 5  else 0.0 for r in ranks])
    mrr10 = np.mean([1.0/r if r <= 10 else 0.0 for r in ranks])

    recall5  = float((ranks <= 5 ).mean())
    recall10 = float((ranks <= 10).mean())

    return {
        "MRR@1":  mrr1,
        "MRR@5":  mrr5,
        "MRR@10": mrr10,
        "Recall@5":  recall5,
        "Recall@10": recall10
    }


# Anwendung
results_train = compute_core_metrics(df_query_train, 'cord_uid', 'bm25_topk')
results_dev   = compute_core_metrics(df_query_dev,   'cord_uid', 'bm25_topk')

print("Train:", results_train)
print("Dev:  ", results_dev)


Train: {'MRR@1': np.float64(0.5584688399595426), 'MRR@5': np.float64(0.610892398661791), 'MRR@10': np.float64(0.6170860919382666), 'Recall@5': 0.6933011748229985, 'Recall@10': 0.7395938691356103}
Dev:   {'MRR@1': np.float64(0.565), 'MRR@5': np.float64(0.6157261904761905), 'MRR@10': np.float64(0.6220745464852608), 'Recall@5': 0.6985714285714286, 'Recall@10': 0.7478571428571429}


Predict on gold label test set and compute the metrics

In [None]:
# Load  test set
df_query_test = pd.read_csv(
    "../../data/subtask4b_query_tweets_test_gold.tsv", 
    sep="\t", 
    dtype={"post_id": str, "tweet_text": str, "cord_uid": str}
)

# Retrieve BM25 top-10 for each test tweet
df_query_test["bm25_topk"] = df_query_test["tweet_text"] \
    .apply(lambda txt: get_top_cord_uids(txt, k=10))

# Compute core metrics on test
results_test = compute_core_metrics(
    df_query_test,
    col_gold="cord_uid",
    col_pred="bm25_topk"
)

print("Test:", results_test)

Test: {'MRR@1': np.float64(0.4384508990318119), 'MRR@5': np.float64(0.502524204702628), 'MRR@10': np.float64(0.509665415267075), 'Recall@5': 0.5982019363762102, 'Recall@10': 0.6507607192254495}


Export Top 10 of BM25

In [20]:
# # TRAIN
# # Take top 10
# df_query_train['preds'] = df_query_train['bm25_topk'].apply(lambda x: x[:10])

# # Export
# df_query_train[['post_id', 'preds']].to_csv('../predictions/predictions_BM25_Pre_Processed_train_TOP10.tsv', index=None, sep='\t')

In [21]:
# # DEV
# # Take top 10
# df_query_dev['preds'] = df_query_dev['bm25_topk'].apply(lambda x: x[:10])

# # Export
# df_query_dev[['post_id', 'preds']].to_csv('../predictions/predictions_BM25_Pre_Processed_dev_TOP10.tsv', index=None, sep='\t')


In [22]:
# # TEST
# # Take top 10
# df_query_test['preds'] = df_query_test['bm25_topk'].apply(lambda x: x[:10])

# # Export
# df_query_test[['post_id', 'preds']].to_csv('../predictions/predictions_BM25_Pre_Processed_test_TOP10.tsv', index=None, sep='\t')


-------------------------------------

### Supplementary baseline models: BM25f and TF-IDF 
_Note: BM25f and TF-IDF was not part of the official CLEF submission; it’s provided here for illustrative/experimental purposes._


Extending the standard BM25 model to BM25F by treating the document's **title** and **abstract** as separate fields with different importance.  
Each field is preprocessed (tokenized, lemmatized, lowercased, stopwords removed) using spaCy.  
The BM25F model combines term frequencies across fields, applying field-specific weights and normalization to better model structured documents.

- **Title weight**: 2.0
- **Abstract weight**: 1.0
- **Length normalization** applied per field
- **IDF** computed across all fields combined

During retrieval, queries are preprocessed similarly, and documents are ranked based on their BM25F scores.  
Caching is used for faster repeated query lookup.

This setup doesnt improve retrieval scores compared to BM25

In [None]:
# Load the small English model
nlp = spacy.load("en_core_web_sm", disable=["parser", "ner"])

def preprocess(text):
    """
    Tokenize input text, lemmatize each token, lowercase it,
    and filter out tokens that are either not pure words or numbers,
    plus remove stopwords.
    """
    doc = nlp(text)
    tokens = []
    for tok in doc:
        if (tok.is_alpha or tok.like_num) and not tok.is_stop:
            tokens.append(tok.lemma_.lower())
    return tokens

# 1) Create two separate corpora: one for title and one for abstract
title_corpus = df_collection['title'].tolist()
abstract_corpus = df_collection['abstract'].tolist()
cord_uids = df_collection['cord_uid'].tolist()

# 2) Tokenize separately
tokenized_title_corpus = [preprocess(doc) for doc in title_corpus]
tokenized_abstract_corpus = [preprocess(doc) for doc in abstract_corpus]

# 3) Calculate average field lengths
avg_title_len = np.mean([len(doc) for doc in tokenized_title_corpus])
avg_abstract_len = np.mean([len(doc) for doc in tokenized_abstract_corpus])

# 4) Build BM25F class
class BM25F:
    def __init__(self, tokenized_titles, tokenized_abstracts, k1=1.5, b_title=0.75, b_abstract=0.75, w_title=2.0, w_abstract=1.0):
        self.tokenized_titles = tokenized_titles
        self.tokenized_abstracts = tokenized_abstracts
        self.k1 = k1
        self.b_title = b_title
        self.b_abstract = b_abstract
        self.w_title = w_title
        self.w_abstract = w_abstract
        
        # Inverted index and stats
        self.doc_count = len(tokenized_titles)
        self.inverted_index_title = self._build_inverted_index(tokenized_titles)
        self.inverted_index_abstract = self._build_inverted_index(tokenized_abstracts)
        self.avg_title_len = avg_title_len
        self.avg_abstract_len = avg_abstract_len
    
    def _build_inverted_index(self, corpus):
        inverted = {}
        for doc_id, tokens in enumerate(corpus):
            for token in tokens:
                if token not in inverted:
                    inverted[token] = []
                inverted[token].append(doc_id)
        return inverted

    def _idf(self, token):
        df_title = len(set(self.inverted_index_title.get(token, [])))
        df_abstract = len(set(self.inverted_index_abstract.get(token, [])))
        df_total = df_title + df_abstract
        if df_total == 0:
            return 0
        return np.log((self.doc_count - df_total + 0.5) / (df_total + 0.5) + 1)

    def get_scores(self, query_tokens):
        scores = np.zeros(self.doc_count)
        for token in query_tokens:
            idf = self._idf(token)
            for doc_id in range(self.doc_count):
                tf_title = self.tokenized_titles[doc_id].count(token)
                tf_abstract = self.tokenized_abstracts[doc_id].count(token)
                
                len_title = len(self.tokenized_titles[doc_id])
                len_abstract = len(self.tokenized_abstracts[doc_id])
                
                norm_title = tf_title / (1 - self.b_title + self.b_title * (len_title / self.avg_title_len)) if len_title else 0
                norm_abstract = tf_abstract / (1 - self.b_abstract + self.b_abstract * (len_abstract / self.avg_abstract_len)) if len_abstract else 0
                
                tf_combined = self.w_title * norm_title + self.w_abstract * norm_abstract
                
                scores[doc_id] += idf * (tf_combined * (self.k1 + 1)) / (tf_combined + self.k1) if tf_combined > 0 else 0
        return scores

# 5) Initialize BM25F
bm25f = BM25F(tokenized_title_corpus, tokenized_abstract_corpus)

# 6) Update the retrieval function
_text2bm25ftop = {}

def get_top_cord_uids(query, k=5):
    if query in _text2bm25ftop:
        return _text2bm25ftop[query]
    tokenized_query = preprocess(query)
    doc_scores = bm25f.get_scores(tokenized_query)
    indices = np.argsort(-doc_scores)[:k]
    topk_uids = [cord_uids[i] for i in indices]
    _text2bm25ftop[query] = topk_uids
    return topk_uids

In [None]:
# Assign BM25F top-k predictions
df_query_train['bm25f_topk'] = df_query_train['tweet_text'].apply(lambda txt: get_top_cord_uids(txt, k=10))
df_query_dev['bm25f_topk']   = df_query_dev['tweet_text'].apply(lambda txt: get_top_cord_uids(txt, k=10))

In [None]:
# Evaluate retrieved candidates using MRR@k
def get_performance_mrr(data, col_gold, col_pred, list_k = [1, 5, 10]):
    d_performance = {}
    for k in list_k:
        data["in_topx"] = data.apply(lambda x: (1/([i for i in x[col_pred][:k]].index(x[col_gold]) + 1) if x[col_gold] in [i for i in x[col_pred][:k]] else 0), axis=1)
        #performances.append(data["in_topx"].mean())
        d_performance[k] = data["in_topx"].mean()
    return d_performance

In [None]:
# Calculate and print MRR for BM25F retrieval
results_train = get_performance_mrr(df_query_train, 'cord_uid', 'bm25f_topk')
results_dev   = get_performance_mrr(df_query_dev,   'cord_uid', 'bm25f_topk')

print(f"Results on train (BM25F): {results_train}")
print(f"Results on dev (BM25F): {results_dev}")


Results on train (BM25F): {1: np.float64(0.5724733525247024), 5: np.float64(0.6222879224046266), 10: np.float64(0.6282211613865702), 20: np.float64(0.6282211613865702), 50: np.float64(0.6282211613865702), 100: np.float64(0.6282211613865702)}
Results on dev (BM25F): {1: np.float64(0.5778571428571428), 5: np.float64(0.6268095238095238), 10: np.float64(0.6322145691609976), 20: np.float64(0.6322145691609976), 50: np.float64(0.6322145691609976), 100: np.float64(0.6322145691609976)}


Export Top 10 of BM25F

In [None]:
# # DEV
# # Take top 10
# df_query_dev['preds'] = df_query_dev['bm25f_topk'].apply(lambda x: x[:10])

# # Export
# df_query_dev[['post_id', 'preds']].to_csv('predictions_dev_BM25F_TOP10.tsv', index=None, sep='\t')


In [None]:
# # TRAIN
# # Take top 10
# df_query_train['preds'] = df_query_train['bm25f_topk'].apply(lambda x: x[:10])

# # Export
# df_query_train[['post_id', 'preds']].to_csv('predictions_train_BM25F_TOP10.tsv', index=None, sep='\t')


#### TF–IDF  
Represents each document as a TF–IDF vector. Each term’s weight is  
$$
\log\bigl(1 + \mathrm{tf}_{t,d}\bigr)\;\times\;\log\frac{N}{\mathrm{df}_{t}}
$$  
- `tf_{t,d}`: raw count of term *t* in document *d*  
- `df_t`: number of documents containing *t*  
- `N`: total number of documents  

A tweet is vectorized in the same way, and we rank all documents by cosine similarity against the tweet vector.


In [None]:
# 1. Load collection
collection = pd.read_pickle('../../data/subtask4b_collection_data.pkl')
collection['text'] = collection['title'].fillna('') + ' ' + collection['abstract'].fillna('')
docs = collection['text'].tolist()
doc_ids = collection['cord_uid'].tolist()

vectorizer = TfidfVectorizer(stop_words='english', ngram_range=(1,2), max_df=0.8, min_df=2)
doc_tfidf = vectorizer.fit_transform(docs)

# 2. Helper to generate top-5 run
def get_tfidf_run(df):
    rows=[]
    for pid, text in zip(df['post_id'], df['tweet_text']):
        sims = linear_kernel(vectorizer.transform([text]), doc_tfidf).flatten()
        top5 = np.argsort(sims)[::-1][:5]
        rows.append({'post_id': pid, 'preds': [doc_ids[i] for i in top5]})
    return pd.DataFrame(rows)

# 3. Generate train+dev runs
df_train_queries = pd.read_csv('../../data/subtask4b_query_tweets_train.tsv', sep='\t')
df_dev_queries   = pd.read_csv('../../data/subtask4b_query_tweets_dev.tsv',   sep='\t')

train_run = get_tfidf_run(df_train_queries)
dev_run   = get_tfidf_run(df_dev_queries)

train_run.to_csv('tfidf_train_run.tsv', sep='\t', index=False)
dev_run  .to_csv('tfidf_dev_run.tsv',   sep='\t', index=False)

# 4. Merge and parse preds
df_train = df_train_queries.merge(train_run, on='post_id')
df_train['tfidf_topk'] = df_train['preds']

df_dev   = df_dev_queries  .merge(dev_run,   on='post_id')
df_dev['tfidf_topk']  = df_dev  ['preds']

# 5. Evaluate
results_train = get_performance_mrr(df_train, 'cord_uid', 'tfidf_topk')
results_dev   = get_performance_mrr(df_dev,   'cord_uid', 'tfidf_topk')

print(f"TF–IDF Results on train set: {results_train}")
print(f"TF–IDF Results on dev   set: {results_dev}")


TF–IDF Results on train set: {1: np.float64(0.4935034622267175), 5: np.float64(0.5555434528903758), 10: np.float64(0.5555434528903758)}
TF–IDF Results on dev   set: {1: np.float64(0.49357142857142855), 5: np.float64(0.5549761904761905), 10: np.float64(0.5549761904761905)}


### Additional Hyperparametertuning for BM25

In [None]:
# Tookenize
def tokenize(text):
    return text.lower().split()

# Tokenize the corpus
corpus = df_collection['abstract'].fillna("").apply(tokenize).tolist()

# Set BM25 parameters
k1_values = [0.8, 1.2, 1.5, 2.0]
b_values = [0.3, 0.5, 0.75, 0.9]

# Prepare queries and relevant documents
queries = df_query_dev['tweet_text'].fillna("").tolist()
relevant_docs = df_query_dev['cord_uid'].tolist()

# Initialize results list
results = []

for k1 in k1_values:
    for b in b_values:
        bm25 = BM25Okapi(corpus, k1=k1, b=b)

        mrr_total = 0
        for idx, query in enumerate(queries):
            tokenized_query = tokenize(query)
            scores = bm25.get_scores(tokenized_query)
            ranked_indices = np.argsort(scores)[::-1]
            ranked_cord_uids = df_collection.iloc[ranked_indices]['cord_uid'].tolist()

            # Calculate MRR@5
            try:
                rank = ranked_cord_uids[:5].index(relevant_docs[idx]) + 1
                mrr_total += 1 / rank
            except ValueError:
                continue  # If the relevant document is not found, skip

        mrr_5 = mrr_total / len(queries)
        results.append(((k1, b), mrr_5))
        print(f"k1={k1}, b={b} → MRR@5: {mrr_5:.4f}")


# default BM25Okapi(corpus, k1=1.5, b=0.75)

k1=0.8, b=0.3 → MRR@5: 0.5220
k1=0.8, b=0.5 → MRR@5: 0.5324
k1=0.8, b=0.75 → MRR@5: 0.5406
k1=0.8, b=0.9 → MRR@5: 0.5401
k1=1.2, b=0.3 → MRR@5: 0.5191
k1=1.2, b=0.5 → MRR@5: 0.5325
k1=1.2, b=0.75 → MRR@5: 0.5399
k1=1.2, b=0.9 → MRR@5: 0.5393
k1=1.5, b=0.3 → MRR@5: 0.5139
k1=1.5, b=0.5 → MRR@5: 0.5286
k1=1.5, b=0.75 → MRR@5: 0.5353
k1=1.5, b=0.9 → MRR@5: 0.5354
k1=2.0, b=0.3 → MRR@5: 0.5069
k1=2.0, b=0.5 → MRR@5: 0.5214
k1=2.0, b=0.75 → MRR@5: 0.5309
k1=2.0, b=0.9 → MRR@5: 0.5305
