# Getting started

### CLEF 2025 - CheckThat! Lab  - Task 4 Scientific Web Discourse - Subtask 4b (Scientific Claim Source Retrieval)

This notebook enables participants of subtask 4b to quickly get started. It includes the following:
- Code to upload data, including:
    - code to upload the collection set (CORD-19 academic papers' metadata)
    - code to upload the query set (tweets with implicit references to CORD-19 papers)
- Code to run a baseline retrieval model (BM25)
- Code to evaluate the baseline model

Participants are free to use this notebook and add their own models for the competition.

In [1]:
import numpy as np
import pandas as pd
from tqdm import tqdm

## 1.a) Import the collection set
The collection set contains metadata of CORD-19 academic papers.

The preprocessed and filtered CORD-19 dataset is available on the Gitlab repository here: https://gitlab.com/checkthat_lab/clef2025-checkthat-lab/-/tree/main/task4/subtask_4b

Participants should first download the file then upload it on the Google Colab session with the following steps.


In [2]:
# 1) Download the collection set from the Gitlab repository: https://gitlab.com/checkthat_lab/clef2025-checkthat-lab/-/tree/main/task4/subtask_4b
# 2) Drag and drop the downloaded file to the "Files" section (left vertical menu on Colab)
# 3) Modify the path to your local file path
PATH_COLLECTION_DATA = '/subtask4b_collection_data.pkl' #MODIFY PATH

In [3]:
df_collection = pd.read_pickle(PATH_COLLECTION_DATA)

In [4]:
df_collection.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7718 entries, 162 to 1056448
Data columns (total 17 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   cord_uid          7718 non-null   object        
 1   source_x          7718 non-null   object        
 2   title             7718 non-null   object        
 3   doi               7677 non-null   object        
 4   pmcid             4959 non-null   object        
 5   pubmed_id         6233 non-null   object        
 6   license           7718 non-null   object        
 7   abstract          7718 non-null   object        
 8   publish_time      7715 non-null   object        
 9   authors           7674 non-null   object        
 10  journal           6668 non-null   object        
 11  mag_id            0 non-null      float64       
 12  who_covidence_id  528 non-null    object        
 13  arxiv_id          20 non-null     object        
 14  label             7718 n

In [5]:
df_collection.head()

Unnamed: 0,cord_uid,source_x,title,doi,pmcid,pubmed_id,license,abstract,publish_time,authors,journal,mag_id,who_covidence_id,arxiv_id,label,time,timet
162,umvrwgaw,PMC,Professional and Home-Made Face Masks Reduce E...,10.1371/journal.pone.0002618,PMC2440799,18612429,cc-by,BACKGROUND: Governments are preparing for a po...,2008-07-09,"van der Sande, Marianne; Teunis, Peter; Sabel,...",PLoS One,,,,umvrwgaw,2008-07-09,1215561600
611,spiud6ok,PMC,The Failure of R (0),10.1155/2011/527610,PMC3157160,21860658,cc-by,"The basic reproductive ratio, R (0), is one of...",2011-08-16,"Li, Jing; Blakeley, Daniel; Smith?, Robert J.",Comput Math Methods Med,,,,spiud6ok,2011-08-16,1313452800
918,aclzp3iy,PMC,Pulmonary sequelae in a patient recovered from...,10.4103/0970-2113.99118,PMC3424870,22919170,cc-by-nc-sa,The pandemic of swine flu (H1N1) influenza spr...,2012,"Singh, Virendra; Sharma, Bharat Bhushan; Patel...",Lung India,,,,aclzp3iy,2012-01-01,1325376000
993,ycxyn2a2,PMC,What was the primary mode of smallpox transmis...,10.3389/fcimb.2012.00150,PMC3509329,23226686,cc-by,The mode of infection transmission has profoun...,2012-11-29,"Milton, Donald K.",Front Cell Infect Microbiol,,,,ycxyn2a2,2012-11-29,1354147200
1053,zxe95qy9,PMC,"Lessons from the History of Quarantine, from P...",10.3201/eid1902.120312,PMC3559034,23343512,no-cc,"In the new millennium, the centuries-old strat...",2013-02-03,"Tognotti, Eugenia",Emerg Infect Dis,,,,zxe95qy9,2013-02-03,1359849600


## 1.b) Import the query set

The query set contains tweets with implicit references to academic papers from the collection set.

The preprocessed query set is available on the Gitlab repository here: https://gitlab.com/checkthat_lab/clef2025-checkthat-lab/-/tree/main/task4/subtask_4b

Participants should first download the file then upload it on the Google Colab session with the following steps.

In [6]:
# 1) Download the query tweets from the Gitlab repository: https://gitlab.com/checkthat_lab/clef2025-checkthat-lab/-/tree/main/task4/subtask_4b?ref_type=heads
# 2) Drag and drop the downloaded file to the "Files" section (left vertical menu on Colab)
# 3) Modify the path to your local file path
PATH_QUERY_TRAIN_DATA = '/subtask4b_query_tweets_train.tsv'
PATH_QUERY_DEV_DATA = '/subtask4b_query_tweets_dev.tsv'

In [7]:
df_query_train = pd.read_csv(PATH_QUERY_TRAIN_DATA, sep = '\t')
df_query_dev = pd.read_csv(PATH_QUERY_DEV_DATA, sep = '\t')

In [8]:
df_query_train.head()


Unnamed: 0,post_id,tweet_text,cord_uid
0,0,Oral care in rehabilitation medicine: oral vul...,htlvpvz5
1,1,this study isn't receiving sufficient attentio...,4kfl29ul
2,2,"thanks, xi jinping. a reminder that this study...",jtwb17u8
3,3,Taiwan - a population of 23 million has had ju...,0w9k8iy1
4,4,Obtaining a diagnosis of autism in lower incom...,tiqksd69


In [9]:
df_query_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12853 entries, 0 to 12852
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   post_id     12853 non-null  int64 
 1   tweet_text  12853 non-null  object
 2   cord_uid    12853 non-null  object
dtypes: int64(1), object(2)
memory usage: 301.4+ KB


In [10]:
df_query_dev.head()


Unnamed: 0,post_id,tweet_text,cord_uid
0,16,covid recovery: this study from the usa reveal...,3qvh482o
1,69,"""Among 139 clients exposed to two symptomatic ...",r58aohnu
2,73,I recall early on reading that researchers who...,sts48u9i
3,93,You know you're credible when NIH website has ...,3sr2exq9
4,96,Resistance to antifungal medications is a grow...,ybwwmyqy


In [11]:
df_query_dev.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1400 entries, 0 to 1399
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   post_id     1400 non-null   int64 
 1   tweet_text  1400 non-null   object
 2   cord_uid    1400 non-null   object
dtypes: int64(1), object(2)
memory usage: 32.9+ KB


# 2) Running the baseline
The following code runs a BM25 baseline.


In [13]:

from rank_bm25 import BM25Okapi


In [14]:
# Create the BM25 corpus
corpus = df_collection[:][['title', 'abstract']].apply(lambda x: f"{x['title']} {x['abstract']}", axis=1).tolist()
cord_uids = df_collection[:]['cord_uid'].tolist()
tokenized_corpus = [doc.split(' ') for doc in corpus]
bm25 = BM25Okapi(tokenized_corpus)

In [15]:
def get_top_cord_uids(query):
  text2bm25top = {}
  if query in text2bm25top.keys():
      return text2bm25top[query]
  else:
      tokenized_query = query.split(' ')
      doc_scores = bm25.get_scores(tokenized_query)
      indices = np.argsort(-doc_scores)[:5]
      bm25_topk = [cord_uids[x] for x in indices]

      text2bm25top[query] = bm25_topk
      return bm25_topk


In [16]:
# Retrieve topk candidates using the BM25 model
df_query_train['bm25_topk'] = df_query_train['tweet_text'].apply(lambda x: get_top_cord_uids(x))
df_query_dev['bm25_topk'] = df_query_dev['tweet_text'].apply(lambda x: get_top_cord_uids(x))

## Using Pretrained bert

### Neural Ranker Class: Handles neural network-based candidate ranking

### Initializes the neural ranking model components by
1. Loads pre-trained tokenizer and model from Hugging Face Hub

2. Automatically detects and uses GPU if available

3. Moves model to appropriate device (CPU/GPU)

### Scores query-document pairs using neural model
1. Batch processes candidate documents

2. Handles variable-length texts with smart truncation

3. Optimized for GPU inference

In [None]:
import torch
import numpy as np
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from tqdm import tqdm

In [None]:
class NeuralRanker:
    def __init__(self, model_name='cross-encoder/ms-marco-MiniLM-L-6-v2'):
        
         # Initialize tokenizer and model from Hugging Face Hub
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForSequenceClassification.from_pretrained(model_name)

        # Set device (GPU if available) for faster computation
        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        self.model.to(self.device)

    def rank_candidates(self, query, candidates):
        """Rerank candidates using neural model"""
        # Create query-document pairs
        pairs = [(query, cand_text) for cand_text in candidates]

        # Tokenize in batch mode
        inputs = self.tokenizer(
            pairs,
            padding=True,
            truncation=True,
            max_length=512,
            return_tensors="pt"
        ).to(self.device)

        # Get scores
        with torch.no_grad():
            outputs = self.model(**inputs)

        # Extract scores - CRITICAL FIX HERE
        scores = outputs.logits.squeeze(-1)  # Remove extra dimension
        return scores.cpu().numpy()


## Two-Stage Retriever Class: Combines BM25 and Neural Ranking

Implements hybrid retrieval (BM25 + neural)

Uses caching to avoid redundant document text lookups

Handles missing documents gracefully

In [None]:
class TwoStageRetriever:
    def __init__(self, bm25, df_collection, neural_ranker):
         # Store components and initialize cache
        self.bm25 = bm25
        self.df_collection = df_collection
        self.neural_ranker = neural_ranker
        self.cord_uids = df_collection['cord_uid'].tolist()
        self.text_cache = {}  # Add caching for document texts

    def get_document_text(self, cord_uid):
        """Efficient document text retrieval with caching"""
        """Get document text with caching"""
        if cord_uid not in self.text_cache:
            # Find document in collection
            doc_row = self.df_collection[self.df_collection['cord_uid'] == cord_uid]
            if not doc_row.empty:
                # Combine title and abstract
                self.text_cache[cord_uid] = f"{doc_row.iloc[0]['title']} {doc_row.iloc[0]['abstract']}"
            else:
                self.text_cache[cord_uid] = None
        return self.text_cache[cord_uid]

    def retrieve(self, query, top_k_bm25=50, top_k_final=5):
        """Two-stage retrieval process"""
        # First stage: BM25 retrieval
        tokenized_query = query.split(' ')
        doc_scores = self.bm25.get_scores(tokenized_query)
        bm25_indices = np.argsort(-doc_scores)[:top_k_bm25]
        candidate_uids = [self.cord_uids[i] for i in bm25_indices]

        # Get valid candidate texts
        candidate_data = []
        for uid in candidate_uids:
            text = self.get_document_text(uid)
            if text:
                candidate_data.append((uid, text))

        # Second stage: Neural reranking
        # Return if no candidates found
        if not candidate_data:
            return []

        # Unzip into separate lists
        uids, texts = zip(*candidate_data)
        scores = self.neural_ranker.rank_candidates(query, texts)

        # Combine and sort results
        # Sort by descending scores and return top-k
        sorted_indices = np.argsort(-scores)
        return [(uids[i], scores[i]) for i in sorted_indices[:top_k_final]]

In [20]:
# Initialize components
neural_ranker = NeuralRanker()
retriever = TwoStageRetriever(bm25, df_collection, neural_ranker)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/1.33k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/794 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

### Combined Evaluation and Submission Generation 
Real-time metric tracking

Memory-efficient streaming

Competition-compliant output format

In [None]:
def evaluate_and_generate_submission(df_query, retriever, output_path="/submission.tsv"):
    """Combined evaluation and submission generation"""
    results = []
    submission_data = []

    # Create progress bar with total iterations
    progress_bar = tqdm(total=len(df_query), desc="Processing queries")

    for idx, row in df_query.iterrows():
        # extract query data    
        query = row['tweet_text']
        true_uid = row['cord_uid']
        post_id = row['post_id']

        # Retrieve ranked results
        ranked_results = retriever.retrieve(query)
        predicted_uids = [uid for uid, _ in ranked_results]

        # Store submission data
        top5_uids = predicted_uids[:5]
        submission_data.append({
            'post_id': post_id,
            'preds': str(top5_uids).replace("'", '"')  # JSON-style array
        })

        # Calculate metrics
        match_position = next(
            (i+1 for i, (uid, _) in enumerate(ranked_results) if uid == true_uid),
            None
        )
        reciprocal_rank = 1/match_position if match_position else 0
        # Store the metrics
        results.append({
            'post_id': post_id,
            'true_uid': true_uid,
            'match_position': match_position,
            'reciprocal_rank': reciprocal_rank
        })

        # Update progress bar
        progress_bar.update(1)
        progress_bar.set_postfix({
            'Current MRR': f"{np.mean([r['reciprocal_rank'] for r in results]):.3f}",
            'Top1 Acc': f"{np.mean([1 if r['match_position'] == 1 else 0 for r in results]):.3f}"
        })

    progress_bar.close()

    # Calculate final metrics
    valid_ranks = [res['reciprocal_rank'] for res in results if res['match_position']]
    mrr = np.mean(valid_ranks) if valid_ranks else 0
    top1_acc = np.mean([1 if res.get('match_position') == 1 else 0 for res in results])

    # Save submission file
    submission_df = pd.DataFrame(submission_data)
    submission_df.to_csv(output_path, sep='\t', index=False)

    print(f"\nFinal Metrics:")
    print(f"MRR@5: {mrr:.3f}")
    print(f"Top-1 Accuracy: {top1_acc:.3f}")
    print(f"Submission file saved to {output_path}")

    return results, submission_df

# Run combined evaluation and generation
print("Running combined evaluation and submission generation...")
results, submission_df = evaluate_and_generate_submission(
    df_query_dev,
    retriever,
    "/submission.tsv"
)

Running combined evaluation and submission generation...


Processing queries: 100%|██████████| 1400/1400 [6:31:29<00:00, 16.78s/it, Current MRR=0.601, Top1 Acc=0.556]


Final Metrics:
MRR@5: 0.890
Top-1 Accuracy: 0.556
Submission file saved to submission.tsv



