# Reproducibility package

This is a reproducibility package for the IR project. It is structured as follows.

**Part 1** contains the setup: we index each corpus and split the datasets into training and test data.
- Here we also run the *baseline* experiments.
    
**Part 2** contains the LLM-based query expansion.
- The *query expansion* part contains the QE methods.
- The *experiment* part contains the results of the experiments.
- 
**Part 3** contains the the similarity(synonyms?)-based query expansion
- The *query expansion* part contains the QE methods.
- The *experiment* part contains the results of the experiments.
  
**Part 4** contains the onthology-based query expansion
- The *query expansion* part contains the QE methods.
- The *experiment* part contains the results of the experiments.
- 
**Part 5** contains the Word2Vec query expansion
**Part 6** contains the Pseudo-relevance query expansion
Both of these are present in ir_project2.ipynb

# Part 1: Setup

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load
#te
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/religious-and-philosophical-texts/35895-0.txt
/kaggle/input/religious-and-philosophical-texts/pg2800.txt
/kaggle/input/religious-and-philosophical-texts/pg2680.txt
/kaggle/input/religious-and-philosophical-texts/pg10.txt
/kaggle/input/religious-and-philosophical-texts/pg17.txt
/kaggle/input/cisi-a-dataset-for-information-retrieval/CISI.REL
/kaggle/input/cisi-a-dataset-for-information-retrieval/CISI.ALL
/kaggle/input/cisi-a-dataset-for-information-retrieval/CISI.QRY


#### Necessary !pip installs

In [3]:
!pip install python-terrier==0.12.1
!pip install owlready2
!pip install numpy==1.24.3
!pip install gensim==4.3.1
!pip install scipy==1.10.1
!pip install -q -U google-genai

Collecting python-terrier==0.12.1
  Downloading python_terrier-0.12.1-py3-none-any.whl.metadata (11 kB)
Collecting ir-datasets>=0.3.2 (from python-terrier==0.12.1)
  Downloading ir_datasets-0.5.10-py3-none-any.whl.metadata (12 kB)
Collecting wget (from python-terrier==0.12.1)
  Downloading wget-3.2.zip (10 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pyjnius>=1.4.2 (from python-terrier==0.12.1)
  Downloading pyjnius-1.6.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (10 kB)
Collecting ir-measures>=0.3.1 (from python-terrier==0.12.1)
  Downloading ir_measures-0.3.7-py3-none-any.whl.metadata (7.0 kB)
Collecting pytrec-eval-terrier>=0.5.3 (from python-terrier==0.12.1)
  Downloading pytrec_eval_terrier-0.5.7-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (984 bytes)
Collecting chest (from python-terrier==0.12.1)
  Downloading chest-0.2.3.tar.gz (9.6 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting inscrip

In [7]:
import numpy as np
import pandas as pd
import pyterrier as pt

In [8]:
import os
#%%capture
!curl -O https://download.java.net/java/GA/jdk11/9/GPL/openjdk-11.0.2_linux-x64_bin.tar.gz
!mv openjdk-11.0.2_linux-x64_bin.tar.gz /usr/lib/jvm/; cd /usr/lib/jvm/; tar -zxvf openjdk-11.0.2_linux-x64_bin.tar.gz
!update-alternatives --install /usr/bin/java java /usr/lib/jvm/jdk-11.0.2/bin/java 1
!update-alternatives --set java /usr/lib/jvm/jdk-11.0.2/bin/java
os.environ["JAVA_HOME"] = "/usr/lib/jvm/jdk-11.0.2"

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  178M  100  178M    0     0   229M      0 --:--:-- --:--:-- --:--:--  228M
jdk-11.0.2/bin/jaotc
jdk-11.0.2/bin/jar
jdk-11.0.2/bin/jarsigner
jdk-11.0.2/bin/java
jdk-11.0.2/bin/javac
jdk-11.0.2/bin/javadoc
jdk-11.0.2/bin/javap
jdk-11.0.2/bin/jcmd
jdk-11.0.2/bin/jconsole
jdk-11.0.2/bin/jdb
jdk-11.0.2/bin/jdeprscan
jdk-11.0.2/bin/jdeps
jdk-11.0.2/bin/jhsdb
jdk-11.0.2/bin/jimage
jdk-11.0.2/bin/jinfo
jdk-11.0.2/bin/jjs
jdk-11.0.2/bin/jlink
jdk-11.0.2/bin/jmap
jdk-11.0.2/bin/jmod
jdk-11.0.2/bin/jps
jdk-11.0.2/bin/jrunscript
jdk-11.0.2/bin/jshell
jdk-11.0.2/bin/jstack
jdk-11.0.2/bin/jstat
jdk-11.0.2/bin/jstatd
jdk-11.0.2/bin/keytool
jdk-11.0.2/bin/pack200
jdk-11.0.2/bin/rmic
jdk-11.0.2/bin/rmid
jdk-11.0.2/bin/rmiregistry
jdk-11.0.2/bin/serialver
jdk-11.0.2/bin/unpack200
jdk-11.0.2/conf/logging.properties
jdk-11.0.2/conf/management/jmx

## Importing and Indexing Datasets

Datasets used (fields for index are customizable but currently `title` and `abstract` used):
- TREC Covid: `IterDictIndexer` needed but already in the `irds:cord19` directory
- NF Corpus: Need to create own index to run PyTerrier experiments

### TREC COVID

In [9]:
import random

trec_covid_dataset = pt.datasets.get_dataset('irds:cord19/trec-covid')
trec_covid_corpus = pd.DataFrame(trec_covid_dataset.get_corpus_iter())

trec_covid_queries = pd.DataFrame(trec_covid_dataset.get_topics('title'))
trec_covid_qrels = pd.DataFrame(trec_covid_dataset.get_qrels())

# Randomly sample 5 elements for training
train_indices = random.sample(range(len(trec_covid_queries)), 5)
trec_covid_queries_train = trec_covid_queries.iloc[train_indices]
trec_covid_queries_test = trec_covid_queries.drop(train_indices)

[INFO] [starting] building docstore
[INFO] If you have a local copy of https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/2020-07-16/metadata.csv, you can symlink it here to avoid downloading it again: /root/.ir_datasets/downloads/80d664e496b8b7e50a39c6f6bb92e0ef
[INFO] [starting] https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/2020-07-16/metadata.csv
docs_iter:   0%|                                    | 0/192509 [00:00<?, ?doc/s]
https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/2020-07-16/metadata.csv: 0.0%| 0.00/269M [00:00<?, ?B/s][A
https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/2020-07-16/metadata.csv: 1.5%| 3.96M/269M [00:00<00:07, 37.6MB/s][A
https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/2020-07-16/metadata.csv: 3.6%| 9.64M/269M [00:00<00:05, 47.0MB/s][A
https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/2020-07-16/metadata.csv: 6.2%| 16.8M/269M [00:00<00:04, 52.1MB/s][A
https://ai2-semanti

terrier-assemblies 5.11 jar-with-dependencies not found, downloading to /root/.pyterrier...
Done
terrier-python-helper 0.0.8 jar not found, downloading to /root/.pyterrier...
Done


Java started (triggered by _pt_tokeniser) and loaded: pyterrier.java, pyterrier.terrier.java [version=5.11 (build: craig.macdonald 2025-01-13 21:29), helper_version=0.0.8]
[INFO] [starting] https://ir.nist.gov/covidSubmit/data/qrels-covid_d5_j0.5-5.txt
[INFO] [finished] https://ir.nist.gov/covidSubmit/data/qrels-covid_d5_j0.5-5.txt: [00:00] [1.14MB] [3.84MB/s]
                                                                                           

In [10]:
!rm -rf ./cord19-index
trec_covid_indexer = pt.index.IterDictIndexer('./cord19-index')
trec_covid_indexref = trec_covid_indexer.index(trec_covid_dataset.get_corpus_iter(), fields=('title', 'abstract'),  meta={"docno": 20, "title": 256, "abstract": 2048})
trec_covid_index = pt.IndexFactory.of(trec_covid_indexref)

cord19/trec-covid documents:   1%|          | 2182/192509 [00:03<02:39, 1193.68it/s]



cord19/trec-covid documents: 100%|██████████| 192509/192509 [01:16<00:00, 2506.14it/s]


14:35:42.191 [ForkJoinPool-1-worker-3] ERROR org.terrier.structures.indexing.Indexer -- Could not finish MetaIndexBuilder: 
java.io.IOException: Key 8lqzfj2e is not unique: 37597,11755
For MetaIndex, to suppress, set metaindex.compressed.reverse.allow.duplicates=true
	at org.terrier.structures.collections.FSOrderedMapFile$MultiFSOMapWriter.mergeTwo(FSOrderedMapFile.java:1374)
	at org.terrier.structures.collections.FSOrderedMapFile$MultiFSOMapWriter.close(FSOrderedMapFile.java:1308)
	at org.terrier.structures.indexing.BaseMetaIndexBuilder.close(BaseMetaIndexBuilder.java:321)
	at org.terrier.structures.indexing.classical.BasicIndexer.indexDocuments(BasicIndexer.java:270)
	at org.terrier.structures.indexing.classical.BasicIndexer.createDirectIndex(BasicIndexer.java:388)
	at org.terrier.structures.indexing.Indexer.index(Indexer.java:377)
	at org.terrier.python.ParallelIndexer$3.apply(ParallelIndexer.java:131)
	at org.terrier.python.ParallelIndexer$3.apply(ParallelIndexer.java:120)
	at java

### NF Corpus

In [11]:
### Train dataset
nf_dataset_train = pt.get_dataset("irds:nfcorpus/train")
nf_corpus_train = pd.DataFrame(nf_dataset_train.get_corpus_iter())

nf_queries_train = pd.DataFrame(nf_dataset_train.get_topics('title'))
nf_qrels_train = pd.DataFrame(nf_dataset_train.get_qrels())

### Test dataset
nf_dataset_test = pt.get_dataset("irds:nfcorpus/test")
nf_corpus_test = pd.DataFrame(nf_dataset_test.get_corpus_iter())

nf_queries_test = pd.DataFrame(nf_dataset_test.get_topics('title'))
nf_qrels_test = pd.DataFrame(nf_dataset_test.get_qrels())

[INFO] If you have a local copy of https://www.cl.uni-heidelberg.de/statnlpgroup/nfcorpus/nfcorpus.tar.gz, you can symlink it here to avoid downloading it again: /root/.ir_datasets/downloads/49c061fbadc52ba4d35d0e42e2d742fd
[INFO] [starting] https://www.cl.uni-heidelberg.de/statnlpgroup/nfcorpus/nfcorpus.tar.gz
nfcorpus/train documents:   0%|          | 0/5371 [00:00<?, ?it/s]
https://www.cl.uni-heidelberg.de/statnlpgroup/nfcorpus/nfcorpus.tar.gz: 0.0%| 0.00/31.0M [00:00<?, ?B/s][A
https://www.cl.uni-heidelberg.de/statnlpgroup/nfcorpus/nfcorpus.tar.gz: 0.0%| 8.19k/31.0M [00:00<08:39, 59.7kB/s][A
https://www.cl.uni-heidelberg.de/statnlpgroup/nfcorpus/nfcorpus.tar.gz: 0.1%| 41.0k/31.0M [00:00<03:33, 145kB/s] [A
https://www.cl.uni-heidelberg.de/statnlpgroup/nfcorpus/nfcorpus.tar.gz: 0.3%| 98.3k/31.0M [00:00<02:14, 229kB/s][A
https://www.cl.uni-heidelberg.de/statnlpgroup/nfcorpus/nfcorpus.tar.gz: 0.7%| 213k/31.0M [00:00<01:23, 371kB/s] [A
https://www.cl.uni-heidelberg.de/statnlpgroup/

In [12]:
!rm -rf ./nf_index
!rm -rf ./nf_index2

nf_indexer = pt.IterDictIndexer("./nf_index")
nf_indexref = nf_indexer.index(
    nf_dataset_test.get_corpus_iter(),
    fields=('title', 'abstract'),
    meta={"docno": 20, "title": 256, "abstract": 2048}
)
nf_index = pt.IndexFactory.of(nf_indexref)

nf_indexer2 = pt.IterDictIndexer("./nf_index2")
nf_indexref2 = nf_indexer2.index(nf_dataset_train.get_corpus_iter(), fields=('title', 'abstract'))
nf_index2 = pt.IndexFactory.of(nf_indexref2)

nfcorpus/test documents: 100%|██████████| 5371/5371 [00:02<00:00, 1850.54it/s]
nfcorpus/train documents: 100%|██████████| 5371/5371 [00:02<00:00, 2048.34it/s]


## Running Baseline Experiments

As specified in our report, these are the evaluation metrics that we will use throughout the entire experiment.

In [13]:
eval_metrics = ["P.5", "P.10", "ndcg_cut.10", "map", "recip_rank", "recall_5", "recall_10"]

In [12]:
DPH_br = pt.BatchRetrieve(
    trec_covid_index,
    wmodel="DPH",
    metadata=["docno", "title", "abstract"],
    properties={"termpipelines": ""}
)
BM25_br = pt.BatchRetrieve(
    trec_covid_index,
    wmodel="BM25",
    metadata=["docno", "title", "abstract"],
    properties={"termpipelines": ""}
)
# this runs an experiment to obtain results on the TREC COVID queries and qrels
pt.Experiment(
    [DPH_br, BM25_br],
    trec_covid_queries_test,
    trec_covid_dataset.get_qrels(),
    eval_metrics=eval_metrics)

  DPH_br = pt.BatchRetrieve(
  BM25_br = pt.BatchRetrieve(




Unnamed: 0,name,P.5,P.10,ndcg_cut.10,map,recip_rank,recall_5,recall_10,recall_100,IPrec@0.0,IPrec@0.1,IPrec@0.2,IPrec@0.3,IPrec@0.4,IPrec@0.5,IPrec@0.6,IPrec@0.7,IPrec@0.8,IPrec@0.9,IPrec@1.0
0,TerrierRetr(DPH),0.675556,0.673333,0.595867,0.210167,0.795596,0.008111,0.015767,0.112434,0.834433,0.524663,0.432141,0.31861,0.233151,0.158181,0.084041,0.013007,0.0,0.0,0.0
1,TerrierRetr(BM25),0.662222,0.666667,0.590693,0.213211,0.802327,0.008012,0.015767,0.108127,0.840001,0.518992,0.418416,0.323877,0.238464,0.189812,0.113398,0.030893,0.009043,0.0,0.0


In [14]:
DPH_br = pt.BatchRetrieve(
    nf_index,
    wmodel="DPH",
    metadata=["docno", "title", "abstract"],
    properties={"termpipelines": ""}
)
BM25_br = pt.BatchRetrieve(
    nf_index,
    wmodel="BM25",
    metadata=["docno", "title", "abstract"],
    properties={"termpipelines": ""}
)
# this runs an experiment to obtain results on the NFCorpus queries and qrels
pt.Experiment(
    [DPH_br, BM25_br],
    nf_dataset_test.get_topics('title'),
    nf_dataset_test.get_qrels(),
    eval_metrics=eval_metrics)

  DPH_br = pt.BatchRetrieve(
  BM25_br = pt.BatchRetrieve(


Unnamed: 0,name,P.5,P.10,ndcg_cut.10,map,recip_rank,recall_5,recall_10,recall_100,IPrec@0.0,IPrec@0.1,IPrec@0.2,IPrec@0.3,IPrec@0.4,IPrec@0.5,IPrec@0.6,IPrec@0.7,IPrec@0.8,IPrec@0.9,IPrec@1.0
0,TerrierRetr(DPH),0.252308,0.192615,0.262754,0.110582,0.478754,0.095377,0.117495,0.200592,0.501839,0.271048,0.188443,0.126374,0.091643,0.07358,0.04988,0.046783,0.035979,0.03239,0.032214
1,TerrierRetr(BM25),0.257846,0.196308,0.267611,0.113329,0.488544,0.098998,0.119202,0.201033,0.510546,0.275142,0.191756,0.131521,0.095264,0.078237,0.052419,0.04736,0.036586,0.0333,0.033124


# Part 2: LLM-based Query Expansion

## Expansion Code

### Few-shot expansion

In [None]:
def get_few_shot_queries_tuples(queries, qrels, corpus):
    query_title_tuples = []

    for (_, query_it) in queries.iterrows():
        qid = query_it["qid"]
        query = query_it["query"]
        docs = qrels[qrels["qid"] == qid]
        
        most_relevant_doc_qrel = docs.sort_values(by="label", ascending=False).iloc[0]["docno"]
        
        most_relevant_doc = corpus[corpus["docno"] == most_relevant_doc_qrel]
        query = queries[queries["qid"] == qid]
        query_title_tuples.append((query["query"].tolist()[0], most_relevant_doc["title"].tolist()[0]))

    return query_title_tuples

few_shot_samples_covid = get_few_shot_queries_tuples(trec_covid_queries_train, trec_covid_qrels, trec_covid_corpus)
few_shot_samples_nf = get_few_shot_queries_tuples(nf_queries_train, nf_qrels_train, nf_corpus_train)

#print(query_title_tuples_covid)

In [None]:
import time
import pandas as pd
from google import genai
from google.genai import types
from google.api_core.exceptions import ResourceExhausted

def expand_query_k_shot(few_shot_samples, query_to_expand, max_retries=6):
    samples_prompt = "You are a scientist. Come up with the title of an article that answers the following query."

    client = genai.Client(api_key="your-key-here")
    
    for query, title in few_shot_samples:
        samples_prompt += f"\nquery: {query}\ntitle: {title}"

    prompt = samples_prompt + f"\nquery: {query_to_expand}\ntitle:"
    
    attempt = 0
    while attempt < max_retries:
        try:
            response = client.models.generate_content(
                model="gemini-2.0-flash", 
                config=types.GenerateContentConfig(
                    system_instruction="You are a scientist. Come up with only one title of an article that answers the following query. Do not use any special characters."
                ),
                contents=prompt
            )
            result = f"{query_to_expand} {response.text}"
            return result
        
        except Exception as e:
            wait_time = 2 ** attempt
            print(f"Token limit exceeded. Retrying in {wait_time} seconds...")
            time.sleep(wait_time)
            attempt += 1
    
    print("Max retries reached. Failed to generate expanded query.")
    return None

def expand_queries_k_shot(queries_to_expand, few_shot_samples):
    expanded_queries = []
    
    for _, row in queries_to_expand.iterrows():
        qid = row['qid']
        query = row['query']
        expanded_query = expand_query_k_shot(few_shot_samples, query)
        if expanded_query:
            expanded_queries.append({'qid': qid, 'query': expanded_query})
        else:
            expanded_queries.append({'qid': qid, 'query': query})  # Fallback to original query if expansion fails
    
    return expanded_queries

In [None]:
print("Expanding covid queries...")
expanded_queries_covid_few_shot = expand_queries_k_shot(pd.DataFrame(trec_covid_queries_test), few_shot_samples_covid)

print("Expanding nf queries...")
expanded_queries_nf_few_shot = expand_queries_k_shot(pd.DataFrame(nf_dataset_test.get_topics('title')), few_shot_samples_nf)

### Zero-shot expansion

In [None]:
def expand_query_zero_shot(query_to_expand, max_retries=5):
    samples_prompt = "You are a scientist. Come up with the title of an article that answers the following query."

    client = genai.Client(api_key="your-key-here")

    prompt = f"query: {query_to_expand}\ntitle:"
    
    attempt = 0
    while attempt < max_retries:
        try:
            response = client.models.generate_content(
                model="gemini-2.0-flash", 
                config=types.GenerateContentConfig(
                    system_instruction="You are a scientist. Come up with only one title of an article that answers the following query. Do not use any special characters."
                ),
                contents=prompt
            )
            result = f"{query_to_expand} {response.text}"
            return result
        
        except Exception as e:
            wait_time = 2 ** attempt
            print(f"Token limit exceeded. Retrying in {wait_time} seconds...")
            time.sleep(wait_time)
            attempt += 1
    
    print("Max retries reached. Failed to generate expanded query.")
    return None

def expand_queries_zero_shot(queries_to_expand):
    expanded_queries = []
    
    for _, row in queries_to_expand.iterrows():
        qid = row['qid']
        query = row['query']
        expanded_query = expand_query_zero_shot(query)
        if expanded_query:
            expanded_queries.append({'qid': qid, 'query': expanded_query})
        else:
            expanded_queries.append({'qid': qid, 'query': query})  # Fallback to original query if expansion fails
    
    return expanded_queries

In [None]:
print("Expanding covid queries...")
expanded_queries_covid_zero_shot = expand_queries_zero_shot(pd.DataFrame(trec_covid_queries_test))

print("Expanding nf queries...")
expanded_queries_nf_zero_shot = expand_queries_zero_shot(pd.DataFrame(nf_dataset_test.get_topics('title')))

In [None]:
import re 

# DPH_br = pt.terrier.Retriever(nf_index, wmodel="DPH") % 100
# BM25_br = pt.terrier.Retriever(nf_index, wmodel="BM25") % 100

def escape_pyterrier_query(query: str) -> str:
    """
    Escapes and sanitizes a query string to prevent PyTerrier parsing errors.
    """
    # Remove markdown-like formatting (e.g., **bold**, *italic*)
    query = re.sub(r'\*+', '', query)  # Remove asterisks
    
    # Remove other non-alphanumeric characters that could cause issues
    query = re.sub(r'["\'{}<>|:\[\]]', ' ', query)  # Remove problematic symbols including quotes and colons
    
    
    return query


def escape_pyterrier_queries(queries):
    """
    Apply the escaping function to a list of queries.
    """
    return [{'qid': query['qid'], 'query': escape_pyterrier_query(query['query'])} for query in queries]

## Experiments

In [None]:
DPH_br = pt.BatchRetrieve(
    trec_covid_index,
    wmodel="DPH",
    metadata=["docno", "title", "abstract"],
    properties={"termpipelines": ""}
)
BM25_br = pt.BatchRetrieve(
    trec_covid_index,
    wmodel="BM25",
    metadata=["docno", "title", "abstract"],
    properties={"termpipelines": ""}
)

### Few-Shot

#### Trec-Covid

In [None]:
trec_covid_results_few_shot = pt.Experiment(
    [DPH_br, BM25_br],
    pd.DataFrame(escape_pyterrier_queries(expanded_queries_covid_few_shot)),
    trec_covid_qrels,
    eval_metrics=eval_metrics)

trec_covid_results_few_shot

#### NFCorpus

In [None]:
nf_results_few_shot = pt.Experiment(
    [DPH_br, BM25_br],
    pd.DataFrame(escape_pyterrier_queries(expanded_queries_nf_few_shot)),
    nf_dataset_test.get_qrels(),
    eval_metrics=eval_metrics)

nf_results_few_shot

In [None]:
trec_covid_results.to_csv('run-trec-output.csv', index=False)
nf_results.to_csv('run-nf-output.csv', index=False)

### Zero-Shot

#### Trec-COVID

In [None]:
trec_covid_results_zero_shot = pt.Experiment(
    [DPH_br, BM25_br],
    pd.DataFrame(escape_pyterrier_queries(expanded_queries_covid_few_shot)),
    trec_covid_qrels,
    eval_metrics=eval_metrics)

trec_covid_results_zero_shot

#### NFCorpus

In [None]:
nf_results_zero_shot = pt.Experiment(
    [DPH_br, BM25_br],
    pd.DataFrame(escape_pyterrier_queries(expanded_queries_nf_few_shot)),
    nf_dataset_test.get_qrels(),
    eval_metrics=eval_metrics)

nf_results_zero_shot

## Expansion code

In [20]:
import pyterrier as pt
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import gensim.downloader as api
from gensim.models import Word2Vec
from gensim.models import KeyedVectors


#from gensim.models import Word2Vec

# Download required NLTK data
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')


# Load pre-trained word embedding model
#word2vec_model = api.load("glove-wiki-gigaword-100")
#word2vec_model = KeyedVectors.load_word2vec_format("BioWordVec_PubMed_MIMICIII_d200.vec.bin", binary=True)

dataset_dev = pt.datasets.get_dataset("irds:nfcorpus/dev")

corpus_texts = [
    f"{doc['title']} {doc['abstract']}"
    for doc in dataset_dev.get_corpus_iter()
    if 'title' in doc and 'abstract' in doc
]

from nltk.tokenize import word_tokenize

tokenized_corpus = [
    [w.lower() for w in word_tokenize(text) if w.isalpha()]
    for text in corpus_texts
]
word2vec_model = Word2Vec(
    sentences=tokenized_corpus,
    vector_size=100,
    window=5,
    min_count=2,
    workers=4
)


stop_words = set(stopwords.words('english'))
topics_df = pd.DataFrame(nf_dataset_test.get_topics('title'))

def expand_query_word2vec(query: str, topn=2):
    tokens = [w.lower() for w in word_tokenize(query) if w.isalpha() and w.lower() not in stop_words]
    expanded = []
    for term in tokens:
        expanded.append(term)  # original term weighted more
        try:
            for w, sim in word2vec_model.wv.most_similar(term, topn=topn):
                if sim > 0.65:
                    expanded.append(w)
        except KeyError:
            continue
    return " ".join(expanded)


expanded_queries_w2v = []
for _, row in topics_df.iterrows():
    qid = row['qid']
    query = row['query']
    exp_query = expand_query_word2vec(query)
    expanded_queries_w2v.append({'qid': qid, 'query': exp_query})
expanded_queries_df = pd.DataFrame(expanded_queries_w2v)

expanded_queries_w2v_df = pd.DataFrame(expanded_queries_w2v)

ImportError: cannot import name 'triu' from 'scipy.linalg' (/usr/local/lib/python3.10/dist-packages/scipy/linalg/__init__.py)

In [None]:
import re

BM25_br = pt.BatchRetrieve(
    nf_index,
    wmodel="BM25",
    metadata=["docno", "title", "abstract"],
    properties={"termpipelines": ""}
)
BM25_br = pt.terrier.Retriever(nf_index, wmodel="BM25") % 100


def escape_pyterrier_query(query: str) -> str:
    """
    Escapes and sanitizes a query string to prevent PyTerrier parsing errors.
    """
    # Remove markdown-like formatting (e.g., **bold**, *italic*)
    query = re.sub(r'\*+', '', query)  # Remove asterisks

    # Remove other non-alphanumeric characters that could cause issues
    query = re.sub(r'[\':"{}<>|\[\]]', ' ', query)  # Remove problematic symbols including brackets

    # Replace multiple consecutive spaces with a single space
    query = re.sub(r'\s+', ' ', query).strip()

    # Remove the word "TITLE" (case insensitive)
    #query = re.sub(r'\bTITLE\b', '', query, flags=re.IGNORECASE)

    # Remove the word "TITLE" (case insensitive)
    #query = re.sub(r'\bSEP\b', '', query, flags=re.IGNORECASE)

    #print(query)
    #print("####")

    return query


def escape_pyterrier_queries(queries):
    """
    Apply the escaping function to a list of queries.
    """
    #for obj in queries:

    #    return []
    res = []

    #for query in queries:
    #    print({'qid': query['qid'], 'query': escape_pyterrier_query(query['query'])})

    return [{'qid': query['qid'], 'query': escape_pyterrier_query(query['query'])} for query in queries]

## Experiment

In [None]:
pt.Experiment(
    [DPH_br, BM25_br],
    pd.DataFrame(escape_pyterrier_queries(expanded_queries_w2v)),
    nf_dataset_test.get_qrels(),
    eval_metrics=eval_metrics)

## Expansion Code

In [21]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import numpy as np

nltk.download('wordnet')
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')

stop_words = set(stopwords.words("english"))

def clean_tokens(text):
    return set(
        w.lower() for w in word_tokenize(text)
        if w.isalpha() and w.lower() not in stop_words
    )


# === PRF-BASED QUERY EXPANSION FUNCTION ===
def expand_query_prf(query: str, index, bm25, k=5):
    """
    Performs PRF expansion using top-k BM25 documents + TF-IDF similarity.
    """

    res = bm25.search(query).head(k)
    medical_docs = (res["title"] + " " + res["abstract"]).tolist()

    if not medical_docs:
        return query


    vectorizer = TfidfVectorizer(stop_words='english')
    doc_vectors = vectorizer.fit_transform(medical_docs)
    query_vector = vectorizer.transform([query])

    cosine_similarities = cosine_similarity(query_vector, doc_vectors).flatten()
    top_doc_index = cosine_similarities.argmax()
    top_doc = medical_docs[top_doc_index]


    top_doc_tokens = clean_tokens(top_doc)
    query_tokens = clean_tokens(query)
    expansion_terms = top_doc_tokens.difference(query_tokens)



    expansion_terms = list(expansion_terms)[:5]

    expanded_query = query + " " + " ".join(expansion_terms)


    return expanded_query

[nltk_data] Downloading package wordnet to /usr/share/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /usr/share/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [22]:
BM25_br = pt.BatchRetrieve(
    nf_index,
    wmodel="BM25",
    metadata=["docno", "title", "abstract"],
    properties={"termpipelines": ""}
)

def prf_expand_df(df):
    return pd.DataFrame([
        {
            "qid": row["qid"],
            "query": expand_query_prf(row["query"], index=nf_index, bm25=BM25_br)
        }
        for _, row in df.iterrows()
    ])


def prf_wrapper(query_obj):
    query = query_obj["query"]
    qid = query_obj["qid"]
    new_query = expand_query_prf(query, index=nf_index, bm25=BM25_br)
    return pd.DataFrame([{"qid": qid, "query": new_query}])

prf_pipe = pt.apply.generic(prf_expand_df) >> BM25_br


  BM25_br = pt.BatchRetrieve(


## Experiment

In [23]:
topics_df = pd.DataFrame(nf_dataset_test.get_topics('title'))

pt.Experiment(
    [BM25_br, prf_pipe],
    topics_df,
    nf_dataset_test.get_qrels(),
    eval_metrics=eval_metrics)


Unnamed: 0,name,P.5,P.10,ndcg_cut.10,map,recip_rank
0,TerrierRetr(BM25),0.257846,0.196308,0.267611,0.113329,0.488544
1,(pt.apply.generic() >> TerrierRetr(BM25)),0.230769,0.171692,0.244468,0.105451,0.491109


# Part 3: Synyonyms-based Query Expansion

## Expansion Code

In [None]:
import nltk
from nltk.corpus import wordnet
from sklearn.metrics.pairwise import cosine_similarity
from nltk.corpus import stopwords
nltk.download('stopwords')  # Download if not already available

STOPWORDS = set(stopwords.words('english'))
# Download required NLTK data
nltk.download('omw-1.4')
nltk.download('wordnet')
nltk.download('wordnet2022')

# nlp = load('en_core_web_sm')
! cp -rf /usr/share/nltk_data/corpora/wordnet2022 /usr/share/nltk_data/corpora/wordnet # temp fix for lookup error.

tokenizer = pt.java.autoclass("org.terrier.indexing.tokenisation.Tokeniser").getTokeniser()
def strip_markup(text):
    return " ".join(tokenizer.getTokens(text))

# 1. Synonym-Based Expansion (WordNet)
def get_synonyms(word):
    synonyms = set()
    for syn in wordnet.synsets(word.lower()):  # lowercase for consistency
        for lemma in syn.lemmas():
            synonym = lemma.name().replace('_', ' ')
            if synonym != word.lower():  # avoid adding the original word
                synonyms.add(synonym)
    return synonyms

def get_filtered_synonyms(term, min_synset_length=3):
    """Get synonyms, filtering out rare/irrelevant ones."""
    term = term.lower()
    synonyms = set()
    
    for syn in wordnet.synsets(term, pos=['n']):
        # Only keep synonyms from "large enough" synsets
        # if len(syn.lemmas()) >= 2:
        for lemma in syn.lemmas():
            synonym = lemma.name().replace('_', ' ').lower()
            if synonym != term and len(synonym.split()) == 1:  # Exclude multi-word synonyms
                synonyms.add(synonym)
    return synonyms

def update_queries(queries_df, function_to_update):    
    expanded_queries = []
    
    for _, row in queries_df.iterrows():
        qid = row['qid']
        query = row['query']
        expanded_terms = set(query.lower().split())  # start with original terms
        
        for term in query.split():
            term = term.lower()  
            if term.isalpha() or term in STOPWORDS:
                expanded_terms.update(function_to_update(term))
        expanded_queries.append({'qid': qid, 'query': strip_markup(' '.join(expanded_terms))})
    
    return pd.DataFrame(expanded_queries)

In [None]:
trec_queries_similarity= update_queries(trec_covid_dataset.get_topics('title'), get_filtered_synonyms)
nf_queries_similarity = update_queries(nf_dataset_test.get_topics('title'), get_filtered_synonyms)

## Experiment

In [None]:
DPH_br = pt.BatchRetrieve(
    trec_covid_index,
    wmodel="DPH",
    metadata=["docno", "title", "abstract"],
    properties={"termpipelines": ""}
)
BM25_br = pt.BatchRetrieve(
    trec_covid_index,
    wmodel="BM25",
    metadata=["docno", "title", "abstract"],
    properties={"termpipelines": ""}
)
# this runs an experiment to obtain results on the TREC COVID queries and qrels
# trec_covid_results = pt.Experiment([DPH_br, BM25_br],,trec_covid_qrels,eval_metrics=eval_metrics)
trec_covid_results = pt.Experiment(
    [DPH_br, BM25_br],
    pd.DataFrame(update_queries(trec_covid_dataset.get_topics('title'), get_filtered_synonyms)),
    trec_covid_qrels,
    eval_metrics=eval_metrics)

In [None]:
trec_covid_results

DPH_br = pt.BatchRetrieve(
    nf_index,
    wmodel="DPH",
    metadata=["docno", "title", "abstract"],
    properties={"termpipelines": ""}
)
BM25_br = pt.BatchRetrieve(
    nf_index,
    wmodel="BM25",
    metadata=["docno", "title", "abstract"],
    properties={"termpipelines": ""}
)
# this runs an experiment to obtain results on the TREC COVID queries and qrels
nf_results = pt.Experiment(
    [DPH_br, BM25_br],
    pd.DataFrame(update_queries(nf_dataset_test.get_topics('title'), get_filtered_synonyms)),
    nf_dataset_test.get_qrels(),
    eval_metrics=eval_metrics)

nf_results

# Part 4: Ontology-based Query Expansion

## Expansion Code

In [None]:
from owlready2 import get_ontology, OwlReadyOntologyParsingError
import requests
import time
import re
import pandas as pd
import time
import nltk
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
nltk.download('omw-1.4')
!unzip /usr/share/nltk_data/corpora/wordnet.zip -d /usr/share/nltk_data/corpora/
#ONTOLOGY
ontology = get_ontology("/kaggle/input/snomed/other/default/1/SCTO.owl").load()
lemmatizer = WordNetLemmatizer()

BLACKLIST_TERMS = {"disorder", "condition", "disease", "has", "type", "system", "value", "within", "structure"}
MAX_EXPANSIONS_PER_TERM = 3


# --- Utility functions ---
def normalize(text):
    return lemmatizer.lemmatize(text.lower())


def clean_query(text):
    text = re.sub(r"[^\w\s\-]", " ", text)
    return re.sub(r"\s+", " ", text).strip()


def get_related_terms_from_ontology(term, ontology):
    related_terms = set()
    term = normalize(term)

    for cls in ontology.classes():
        labels = [normalize(str(label)) for label in cls.label]

        # Allow partial match
        if any(term in label for label in labels):
            related_terms.update(labels)

            # Add 1-hop parent labels
            for parent in cls.is_a:
                if hasattr(parent, 'label'):
                    related_terms.update(normalize(str(label)) for label in parent.label)

            break  # Stop after first matching class for precision

    # Filter out generic terms
    return {
        t for t in related_terms
        if len(t) > 3 and t not in BLACKLIST_TERMS
    }


def expand_queries_with_ontology(queries_df, ontology, max_expansions=MAX_EXPANSIONS_PER_TERM):
    expanded_queries = []

    for _, row in queries_df.iterrows():
        qid = row['qid']
        query = row['query']
        original_terms = set(normalize(term) for term in query.split())

        expanded_terms = set(original_terms)
        for term in original_terms:
            related = list(get_related_terms_from_ontology(term, ontology))
            expanded_terms.update(related[:max_expansions])

        expanded_query = clean_query(" ".join(sorted(expanded_terms)))
        expanded_queries.append({'qid': qid, 'query': expanded_query})

    return pd.DataFrame(expanded_queries)


BIOPORTAL_API_KEY = "73a8f3f3-a0c5-4b17-98aa-2f8898715364"
cache = {}
MAX_EXPANSIONS_PER_TERM = 3
BLACKLIST_TERMS = {"has", "location", "attribute", "structure", "within", "of", "in", "disorder", "diseases", "value", "and", "or"}

def is_valid_term(term):
    return len(term) > 3 and not term.isnumeric() and term.lower() not in BLACKLIST_TERMS

def clean_terrier_query(text):
    text = re.sub(r"[^\w\s\-]", " ", text)
    text = re.sub(r"\s+", " ", text)
    return text.strip()

def filter_expansion_terms(terms):
    return {
        t for t in terms 
        if is_valid_term(t) and not any(bl in t.lower() for bl in BLACKLIST_TERMS)
    }

def get_bioportal_expansion(term, ontology="SNOMEDCT,MESH,DOID", retries=3):
    term = term.lower()
    if term in cache:
        return cache[term]

    url = f"http://data.bioontology.org/search?q={term}&ontologies={ontology}&require_exact_match=true"
    headers = {'Authorization': f'apikey token={BIOPORTAL_API_KEY}'}
    related_terms = set()

    for attempt in range(retries):
        try:
            response = requests.get(url, headers=headers, timeout=5)
            if response.status_code == 200:
                data = response.json()
                for result in data.get('collection', []):
                    if 'synonym' in result:
                        related_terms.update(s.lower() for s in result['synonym'])
                    if 'prefLabel' in result:
                        related_terms.add(result['prefLabel'].lower())
                break
            else:
                print(f"⚠️ API error ({response.status_code}) for term: {term}")
        except requests.exceptions.Timeout:
            print(f"⏱️ Timeout on term: {term}, retrying ({attempt+1})...")
            time.sleep(1)

    # Filter & limit expansions
    filtered = filter_expansion_terms(related_terms)
    limited_terms = set(list(filtered)[:MAX_EXPANSIONS_PER_TERM])
    cache[term] = limited_terms
    time.sleep(0.1)
    return limited_terms

def expand_queries_with_bioportal(queries_df, ontology="SNOMEDCT,MESH,DOID"):
    all_terms = set()
    for query in queries_df["query"]:
        all_terms.update(query.lower().split())

    print(f"🔍 Expanding {len(all_terms)} unique terms...")

    for term in all_terms:
        if is_valid_term(term):
            get_bioportal_expansion(term, ontology)

    expanded_queries = []
    for _, row in queries_df.iterrows():
        qid = row['qid']
        query = row['query']
        expanded_terms = set(query.lower().split())
        for term in query.lower().split():
            if is_valid_term(term):
                expanded_terms.update(cache.get(term, []))
        expanded_query = clean_terrier_query(" ".join(sorted(expanded_terms)))
        expanded_queries.append({'qid': qid, 'query': expanded_query})

    return pd.DataFrame(expanded_queries)

UMLS_APIKEY = "1c29b5c5-1f97-4395-9aa5-e0a8afd525b0"
AUTH_ENDPOINT = "https://utslogin.nlm.nih.gov"
API_ENDPOINT = "https://uts-ws.nlm.nih.gov"
''
# 🔐 Step 1: Get TGT (Ticket-Granting Ticket)
def get_tgt(api_key):
    params = {'apikey': api_key}
    headers = {'Content-Type': 'application/x-www-form-urlencoded'}
    response = requests.post(f"{AUTH_ENDPOINT}/cas/v1/api-key", data=params, headers=headers)
    if response.status_code == 201:
        tgt_url = re.search(r'action="(.+?)"', response.text).group(1)
        return tgt_url
    else:
        raise Exception("Failed to get TGT")

# 🎫 Step 2: Get ST (Service Ticket)
def get_st(tgt):
    params = {'service': 'http://umlsks.nlm.nih.gov'}
    response = requests.post(tgt, data=params)
    return response.text

# 🔍 Step 3: Get synonyms from UMLS
def umls_expand_term(term, tgt, max_expansions=1):
    st = get_st(tgt)
    url = f"{API_ENDPOINT}/rest/search/current?string={term}&ticket={st}&pageSize=1"
    response = requests.get(url)
    data = response.json()
    try:
        cui = data['result']['results'][0]['ui']
        if cui == "NONE":
            return set()
    except:
        return set()

    st = get_st(tgt)
    url = f"{API_ENDPOINT}/rest/content/current/CUI/{cui}/atoms?ticket={st}&language=ENG"
    response = requests.get(url)
    atoms = response.json().get("result", [])

    synonyms = set()
    for atom in atoms:
        name = atom.get("name", "").lower()
        if len(name) > 3 and term.lower() not in name:
            synonyms.add(name)
    return set(list(synonyms)[:max_expansions])

# 🧼 Sanitize query text
def clean_query(text):
    text = re.sub(r"[^\w\s\-]", " ", text)
    text = re.sub(r"\s+", " ", text)
    return text.strip()

# 🧠 Expand all queries in a DataFrame
def expand_queries_with_UMLS(queries_df, max_expansions=3):
    tgt = get_tgt(UMLS_APIKEY)
    all_terms = set()
    for query in queries_df["query"]:
        all_terms.update(query.lower().split())

    print(f"🔍 Expanding {len(all_terms)} unique terms via UMLS...")

    # Expand once per term
    term_expansions = {}
    for term in all_terms:
        if len(term) > 2 and not term.isnumeric():
            try:
                term_expansions[term] = umls_expand_term(term, tgt, max_expansions)
            except Exception as e:
                print(f"⚠️ UMLS expansion failed for '{term}': {e}")
                term_expansions[term] = set()
            time.sleep(0.1)

    # Build expanded query DataFrame
    expanded_queries = []
    for _, row in queries_df.iterrows():
        qid = row['qid']
        query = row['query']
        terms = query.lower().split()
        expanded_terms = set(terms)
        for term in terms:
            expanded_terms.update(term_expansions.get(term, []))
        expanded_query = clean_query(" ".join(sorted(expanded_terms)))
        expanded_queries.append({'qid': qid, 'query': expanded_query})

    return pd.DataFrame(expanded_queries)

[nltk_data] Downloading package wordnet to /usr/share/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /usr/share/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
Archive:  /usr/share/nltk_data/corpora/wordnet.zip
replace /usr/share/nltk_data/corpora/wordnet/lexnames? [y]es, [n]o, [A]ll, [N]one, [r]ename: 

## Experiment

In [None]:
def run_expanded_experiments(index, queries, qrels, ontology=None, eval_metrics=eval_metrics):
    results = {}

    # UMLS expansion
    umls_expanded = expand_queries_with_UMLS(queries)
    results["UMLS"] = pt.Experiment([
        pt.BatchRetrieve(index, wmodel="DPH", metadata=["docno", "title", "abstract"], properties={"termpipelines": ""}),
        pt.BatchRetrieve(index, wmodel="BM25", metadata=["docno", "title", "abstract"], properties={"termpipelines": ""})
    ], umls_expanded, qrels, eval_metrics)

    # BioPortal expansion
    bioportal_expanded = expand_queries_with_bioportal(queries)
    results["BioPortal"] = pt.Experiment([
        pt.BatchRetrieve(index, wmodel="DPH", metadata=["docno", "title", "abstract"], properties={"termpipelines": ""}),
        pt.BatchRetrieve(index, wmodel="BM25", metadata=["docno", "title", "abstract"], properties={"termpipelines": ""})
    ], bioportal_expanded, qrels, eval_metrics)

    # Custom ontology expansion (only if provided)
    if ontology:
        ontology_expanded = expand_queries_with_ontology(queries, ontology)
        results["Ontology"] = pt.Experiment([
            pt.BatchRetrieve(index, wmodel="DPH", metadata=["docno", "title", "abstract"], properties={"termpipelines": ""}),
            pt.BatchRetrieve(index, wmodel="BM25", metadata=["docno", "title", "abstract"], properties={"termpipelines": ""})
        ], ontology_expanded, qrels, eval_metrics)

    return results

In [None]:
def run_expanded_experiments(index, queries, qrels, ontology=None, eval_metrics=eval_metrics):
    results = {}

    # UMLS expansion
    umls_expanded = expand_queries_with_UMLS(queries)
    results["UMLS"] = pt.Experiment([
        pt.BatchRetrieve(index, wmodel="DPH", metadata=["docno", "title", "abstract"], properties={"termpipelines": ""}),
        pt.BatchRetrieve(index, wmodel="BM25", metadata=["docno", "title", "abstract"], properties={"termpipelines": ""})
    ], umls_expanded, qrels, eval_metrics)

    # BioPortal expansion
    bioportal_expanded = expand_queries_with_bioportal(queries)
    results["BioPortal"] = pt.Experiment([
        pt.BatchRetrieve(index, wmodel="DPH", metadata=["docno", "title", "abstract"], properties={"termpipelines": ""}),
        pt.BatchRetrieve(index, wmodel="BM25", metadata=["docno", "title", "abstract"], properties={"termpipelines": ""})
    ], bioportal_expanded, qrels, eval_metrics)

    # Custom ontology expansion (only if provided)
    if ontology:
        ontology_expanded = expand_queries_with_ontology(queries, ontology)
        results["Ontology"] = pt.Experiment([
            pt.BatchRetrieve(index, wmodel="DPH", metadata=["docno", "title", "abstract"], properties={"termpipelines": ""}),
            pt.BatchRetrieve(index, wmodel="BM25", metadata=["docno", "title", "abstract"], properties={"termpipelines": ""})
        ], ontology_expanded, qrels, eval_metrics)

    return results

In [None]:
import random

# Sample test queries
trec_covid_queries = pd.DataFrame(trec_covid_dataset.get_topics('title'))
train_indices = random.sample(range(len(trec_covid_queries)), 5)
trec_covid_queries_test = trec_covid_queries.drop(train_indices)

# Run all 3 expansion strategies
expanded_results_trec = run_expanded_experiments(
    index=trec_covid_index,
    queries=trec_covid_queries_test,
    qrels=trec_covid_dataset.get_qrels(),
    ontology=ontology  # pass this only if you want ontology-based expansion
)

expanded_results_nf = run_expanded_experiments(
    index=nf_index,
    queries=pd.DataFrame(nf_dataset_test.get_topics('title')),
    qrels=nf_dataset_test.get_qrels(),
    ontology=ontology
)

In [None]:
# Access BM25 results with UMLS expansion
expanded_results_trec["UMLS"]

# Access all DPH results with BioPortal expansion
expanded_results_nf["BioPortal"]

In [None]:
expanded_results_trec