### Start ElasticSearch manually before running the notebook:
On Windows:
- Make sure you have at least JDK 17
- Open a terminal and execute this (or run it as a Windows service):
```bash
C:\path\to\elasticsearch-8.17.2\bin\elasticsearch.bat
```
- No Greek characters should be present in the path.
- Leave that terminal window open.

- If no password was autogenerated execute this to get one:
```bash
.\bin\elasticsearch-reset-password.bat -u elastic
```

In [29]:
%pip install -r "..\\requirements.txt"

Note: you may need to restart the kernel to use updated packages.


3210122 + 3210191 = 6420313
- So we get the `trec_covid` IR2025 collection.

In [30]:
%pip list

Package                   Version
------------------------- --------------
anyio                     4.9.0
argon2-cffi               23.1.0
argon2-cffi-bindings      21.2.0
arrow                     1.3.0
asttokens                 3.0.0
async-lru                 2.0.5
attrs                     25.3.0
babel                     2.17.0
beautifulsoup4            4.13.3
bleach                    6.2.0
certifi                   2025.1.31
cffi                      1.17.1
chardet                   5.2.0
charset-normalizer        3.4.1
click                     8.1.8
colorama                  0.4.6
comm                      0.2.2
contourpy                 1.3.1
cycler                    0.12.1
debugpy                   1.8.13
decorator                 5.2.1
defusedxml                0.7.1
elastic-transport         8.17.1
elasticsearch             8.10.0
et_xmlfile                2.0.0
executing                 2.2.0
faiss-cpu                 1.10.0
fastjsonschema            2.21.1
fonttools    

> Load and Preprocess the Data

In [31]:
# import json
# import re
# import nltk
# from nltk.corpus import stopwords
# from nltk.stem import SnowballStemmer

# nltk.download('stopwords')

# stop_words = set(stopwords.words('english'))
# stemmer = SnowballStemmer("english")

# def preprocess(text):
#     # Lowercase
#     text = text.lower()
#     # Remove punctuation
#     text = re.sub(r"[^\w\s]", "", text)
#     # Tokenize
#     tokens = text.split()
#     # Remove stopwords and apply stemming
#     tokens = [stemmer.stem(word) for word in tokens if word not in stop_words]
#     # Join back into string
#     return " ".join(tokens)

# def process_jsonl(input_path="..\\data\\trec-covid\\corpus.jsonl", output_path="..\\data\\corpus_processed.jsonl"):
#     with open(input_path, 'r', encoding='utf-8') as infile, open(output_path, 'w', encoding='utf-8') as outfile:
#         for line in infile:
#             obj = json.loads(line)
#             if "text" in obj:
#                 obj["text"] = preprocess(obj["text"])
#             json.dump(obj, outfile)
#             outfile.write("\n")

In [32]:
# # Verify preprocessing works
# example = "The quick brown foxes were jumping over the lazy dogs."
# print(preprocess(example))

In [33]:
# process_jsonl()

### Step 1: Load, Preprocess Data & Create Index

In [34]:
from dotenv import load_dotenv
import os

# Load .env file from the current directory
load_dotenv("..\\secrets\\secrets.env")

# Access environment variables
es_host = os.getenv("ES_HOST")
es_user = os.getenv("ES_USERNAME")
es_pass = os.getenv("ES_PASSWORD")

- Connect to ElasticSearch

In [35]:
from elasticsearch import Elasticsearch

es = Elasticsearch(es_host, basic_auth=(es_user, es_pass))

if es.ping():
    print("✅ Connected to ElasticSearch")
else:
    print("❌ Connection failed")

✅ Connected to ElasticSearch


- Create Index

In [36]:
INDEX_NAME = "ir2025-index"

# Delete the index if it already exists
if es.indices.exists(index=INDEX_NAME):
    es.indices.delete(index=INDEX_NAME)
    print(f"✅ Index '{INDEX_NAME}' deleted.")

# Define the settings and mappings for the index
settings = {
    "analysis": {
        "filter": {
            "english_stop": {
                "type": "stop",
                "stopwords": "_english_"
            }
        },
        "analyzer": {
            "custom_english": {
                "type": "custom",
                "tokenizer": "standard",
                "filter": [
                    "lowercase",
                    "english_stop"
                ]
            }
        }
    }
}

mappings = {
    "properties": {
        "doc_id": {"type": "keyword"},
        "text": {
            "type": "text",
            "analyzer": "custom_english",
            "similarity": "BM25"
        }
    }
}

# Create the index with the specified settings and mappings
es.indices.create(
    index=INDEX_NAME,
    settings=settings,
    mappings=mappings
)
print(f"✅ Index '{INDEX_NAME}' created")

✅ Index 'ir2025-index' deleted.
✅ Index 'ir2025-index' created


### Step 2: Populate Index

In [37]:
import json
from elasticsearch.helpers import streaming_bulk
from tqdm import tqdm

# Generator function to yield documents
def generate_documents(file_path):
    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            doc = json.loads(line)
            yield {
                "_index": INDEX_NAME,
                "_id": doc["_id"],
                "_source": {
                    "doc_id": doc["_id"],
                    "text": doc["text"]
                }
            }

# Path to your JSONL file
file_path = "../data/trec-covid/corpus.jsonl"

# Count the total number of documents for the progress bar
with open(file_path, 'r', encoding='utf-8') as f:
    total_docs = sum(1 for _ in f)

# Initialize the progress bar
progress = tqdm(unit="docs", total=total_docs)

successes = 0
for ok, action in streaming_bulk(client=es, actions=generate_documents(file_path), chunk_size=500):
    progress.update(1)
    successes += int(ok)

progress.close()
print(f"✅ Indexed {successes}/{total_docs} documents into '{INDEX_NAME}'")

100%|██████████| 171332/171332 [00:26<00:00, 6375.04docs/s]

✅ Indexed 171332/171332 documents into 'ir2025-index'





### Step 3: Execute Queries

In [None]:
import json
from tqdm import tqdm

def process_queries(queries_path, folder):
    # Load queries
    with open(queries_path, 'r', encoding='utf-8') as f:
        queries = [json.loads(line) for line in f]
        
    INDEX_NAME = "ir2025-index"
    k_values = [20, 30, 50] # Number of top documents to retrieve
        
    runs = {f"run_{k}": {} for k in k_values}
    for k in k_values:
        # Prepare output directory
        output_dir = f"../results/{folder}"
        os.makedirs(output_dir, exist_ok=True)
        for query in tqdm(queries, desc=f"Processing Queries for run with k = {k}"):
            qid = query["_id"]
            query_text = query["text"]
            response = es.search(
                index=INDEX_NAME,
                query={"match": {"text": query_text}},
                size=k
            )
            runs[f"run_{k}"][qid] = {hit["_id"]: hit["_score"] for hit in response["hits"]["hits"]}
                
        with open(os.path.join(output_dir, f'retrieval_top_{k}.json'), 'w', encoding='utf-8') as f:
            json.dump(runs[f"run_{k}"], f, ensure_ascii=False, indent=4)
            print(f"✅ Results saved to: ../results/{folder}/retrieval_top_{k}.json")
    
    return runs
    
runs = process_queries("../data/trec-covid/queries.jsonl", folder='phase_1')

Processing Queries for run with k = 20:   0%|          | 0/50 [00:00<?, ?it/s]

Processing Queries for run with k = 20: 100%|██████████| 50/50 [00:00<00:00, 112.16it/s]


✅ Results saved to: ../results/phase_1/retrieval_top_20.json


Processing Queries for run with k = 30: 100%|██████████| 50/50 [00:00<00:00, 132.90it/s]


✅ Results saved to: ../results/phase_1/retrieval_top_30.json


Processing Queries for run with k = 50: 100%|██████████| 50/50 [00:00<00:00, 112.45it/s]

✅ Results saved to: ../results/phase_1/retrieval_top_50.json





### Step 4: Query Evaluation

In [52]:
import csv

def load_qrels(qrels_path="../data/trec-covid/qrels/test.tsv"):
    qrels = {}
    with open(qrels_path, 'r', encoding='utf-8') as f:
        reader = csv.DictReader(f, delimiter='\t')
        for row in reader:
            qid = row['query-id']
            docid = row['corpus-id']
            relevance = int(row['score'])
            qrels.setdefault(qid, {})[docid] = relevance
    return qrels

qrels = load_qrels()

In [40]:
import pytrec_eval
import json
import os

def compute_metrics(qrels, runs, folder):    
    evaluator = pytrec_eval.RelevanceEvaluator(qrels, {'map', 'P.5', 'P.10', 'P.15', 'P.20'})
    for run_name, run in runs.items():
        k = run_name.split("_")[1]
        print(f"Computing metrics for run with k = {k}")
        results = evaluator.evaluate(run)
    
        # Compute average metrics
        metrics = ['map', 'P_5', 'P_10', 'P_15', 'P_20']
        avg_scores = {metric: 0.0 for metric in metrics}
        num_queries = len(results)
        
        for res in results.values():
            for metric in metrics:
                avg_scores[metric] += res.get(metric, 0.0)
        
        for metric in metrics:
            avg_scores[metric] /= num_queries
        
        # Prepare output directory
        output_dir = os.path.join("../results", folder)
        os.makedirs(output_dir, exist_ok=True)
        
        # Save per-query metrics
        per_query_path = os.path.join(output_dir, f"per_query_metrics_top_{k}.json")
        with open(per_query_path, "w", encoding="utf-8") as f:
            json.dump(results, f, indent=4)
        
        # Save average metrics
        avg_metrics_path = os.path.join(output_dir, f"average_metrics_top_{k}.json")
        with open(avg_metrics_path, "w", encoding="utf-8") as f:
            json.dump(avg_scores, f, indent=4)
        
        print(f"✅ Per-query metrics saved to: {per_query_path}")
        print(f"✅ Average metrics saved to: {avg_metrics_path}\n")
    
compute_metrics(qrels, runs, 'phase_1')

Computing metrics for run with k = 20
✅ Per-query metrics saved to: ../results\phase_1\per_query_metrics_top_20.json
✅ Average metrics saved to: ../results\phase_1\average_metrics_top_20.json

Computing metrics for run with k = 30
✅ Per-query metrics saved to: ../results\phase_1\per_query_metrics_top_30.json
✅ Average metrics saved to: ../results\phase_1\average_metrics_top_30.json

Computing metrics for run with k = 50
✅ Per-query metrics saved to: ../results\phase_1\per_query_metrics_top_50.json
✅ Average metrics saved to: ../results\phase_1\average_metrics_top_50.json



In [41]:
import json
import nltk
from nltk.corpus import wordnet as wn
from tqdm import tqdm
import pandas as pd

nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\mitsa\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\mitsa\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [42]:
import jsonlines
import pandas as pd

input_dir = '../data/trec-covid/'

with jsonlines.open(input_dir + 'corpus.jsonl') as reader:
    corpus = [obj for obj in reader]

with jsonlines.open(input_dir + 'queries.jsonl') as reader:
    queries = [obj for obj in reader]

test_df = pd.read_csv(input_dir + 'qrels/' + 'test.tsv', sep='\t')

In [43]:
def get_wordnet_synonyms(word):
    synonyms = set()
    for syn in wn.synsets(word):
        if syn.pos() in ('n', 'a'):
            for lemma in syn.lemmas():
                synonym = lemma.name().replace("_", " ").lower()
                if synonym != word:
                    synonyms.add(synonym)
    return list(synonyms)

In [44]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

nltk.download('punkt')
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

def expand_query_with_synonyms(query_text):
    expanded_terms = []
    for word in word_tokenize(query_text.lower()):
        if word.isalpha() and word not in stop_words:
            synonyms = get_wordnet_synonyms(word)
            expanded_terms.extend(synonyms)
    return query_text + " " + " ".join(expanded_terms)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\mitsa\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\mitsa\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [45]:
expanded_queries = []
for query in tqdm(queries):
    new_query = query.copy()
    new_query["expanded_text"] = expand_query_with_synonyms(query["text"])
    expanded_queries.append(new_query)

100%|██████████| 50/50 [00:03<00:00, 13.83it/s]


In [49]:
import jsonlines

with jsonlines.open("../data/trec-covid/queries_expanded_wordnet.jsonl", mode='w') as writer:
    for q in expanded_queries:
        writer.write({
            "_id": q["_id"],
            "text": q["expanded_text"]
        })
    print("✅ Expanded queries saved to queries_expanded_wordnet.jsonl")

✅ Expanded queries saved to queries_expanded_wordnet.jsonl


In [50]:
runs = process_queries("../data/trec-covid/queries_expanded_wordnet.jsonl", folder='phase_2')

Processing Queries for run with k = 20:   0%|          | 0/50 [00:00<?, ?it/s]

Processing Queries for run with k = 20: 100%|██████████| 50/50 [00:01<00:00, 43.22it/s]


✅ Results saved to: ../results/phase_2/retrieval_top_20.json


Processing Queries for run with k = 30: 100%|██████████| 50/50 [00:00<00:00, 57.91it/s]


✅ Results saved to: ../results/phase_2/retrieval_top_30.json


Processing Queries for run with k = 50: 100%|██████████| 50/50 [00:01<00:00, 48.26it/s]

✅ Results saved to: ../results/phase_2/retrieval_top_50.json





In [53]:
qrels = load_qrels()

In [54]:
compute_metrics(qrels, runs, 'phase_2')

Computing metrics for run with k = 20
✅ Per-query metrics saved to: ../results\phase_2\per_query_metrics_top_20.json
✅ Average metrics saved to: ../results\phase_2\average_metrics_top_20.json

Computing metrics for run with k = 30
✅ Per-query metrics saved to: ../results\phase_2\per_query_metrics_top_30.json
✅ Average metrics saved to: ../results\phase_2\average_metrics_top_30.json

Computing metrics for run with k = 50
✅ Per-query metrics saved to: ../results\phase_2\per_query_metrics_top_50.json
✅ Average metrics saved to: ../results\phase_2\average_metrics_top_50.json

