# Information Retrieval - Phase 3: Query Expansion with Word2Vec Synonyms on the IR2025 Collection


TODO
---
> Maria Schoinaki, BSc Student <br />
> Department of Informatics, Athens University of Economics and Business <br />
> p3210191@aueb.gr <br/><br/>

> Nikos Mitsakis, BSc Student <br />
> Department of Informatics, Athens University of Economics and Business <br />
> p3210122@aueb.gr <br/><br/>

### Start ElasticSearch manually before running the notebook:
On Windows:
- Make sure you have at least JDK 17
- Open a terminal and execute this (or run it as a Windows service):
```bash
C:\path\to\elasticsearch-8.17.2\bin\elasticsearch.bat
```
- No Greek characters should be present in the path.
- Leave that terminal window open.

- If no password was autogenerated execute this to get one:
```bash
.\bin\elasticsearch-reset-password.bat -u elastic
```

In [14]:
%pip install -qq -r "..\\requirements.txt" 
# fix path accordingly

Note: you may need to restart the kernel to use updated packages.


In [15]:
from collections import Counter
import jsonlines
import json
import csv
import pandas as pd
from tqdm import tqdm
import pytrec_eval
from IPython.display import display

In [16]:
from dotenv import load_dotenv
import os

# Load .env file from the current directory
load_dotenv("..\\secrets\\secrets.env")

# Access environment variables
es_host = os.getenv("ES_HOST")
es_user = os.getenv("ES_USERNAME")
es_pass = os.getenv("ES_PASSWORD")

- Connect to ElasticSearch

In [17]:
from elasticsearch import Elasticsearch

es = Elasticsearch(es_host, basic_auth=(es_user, es_pass), request_timeout=30, retry_on_timeout=True, max_retries=10)

if es.ping():
    print("✅ Connected to ElasticSearch")
else:
    print("❌ Connection failed")

✅ Connected to ElasticSearch


- Load Index

In [18]:
INDEX_NAME = "ir2025-index"

# Delete the index if it already exists
if es.indices.exists(index=INDEX_NAME):
    es.indices.delete(index=INDEX_NAME)
    print(f"✅ Index '{INDEX_NAME}' deleted")

# Define the settings and mappings for the index
settings = {
    "analysis": {
        "filter": {
            "english_stop": {
                "type": "stop",
                "stopwords": "_english_"
            },
            "english_stemmer": {
                "type": "kstem"
            }
        },
        "analyzer": {
            "custom_english": {
                "type": "custom",
                "tokenizer": "standard",
                "filter": [
                    "lowercase", # Converts all terms to lowercase
                    "english_stop", # Removes English stop words
                    "english_stemmer" # Reduces words to their root form usign kstem
                ]
            }
        }
    }
}

mappings = {
    "properties": {
        "doc_id": {"type": "keyword"},
        "text": {
            "type": "text",
            "analyzer": "custom_english",
            "similarity": "BM25"
        }
    }
}

# Create the index with the specified settings and mappings
es.indices.create(
    index=INDEX_NAME,
    settings=settings,
    mappings=mappings
)
print(f"✅ Index '{INDEX_NAME}' created")

✅ Index 'ir2025-index' deleted
✅ Index 'ir2025-index' created


## 3. Document Ingestion  
Using the `streaming_bulk` helper, we ingest all IR2025 documents in chunks of 500.  
A progress bar (tqdm) provides real‐time feedback on indexing throughput.

In [19]:
from elasticsearch.helpers import streaming_bulk

# Generator function to yield documents
def generate_documents(file_path):
    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            doc = json.loads(line)
            yield {
                "_index": INDEX_NAME,
                "_id": doc["_id"],
                "_source": {
                    "doc_id": doc["_id"],
                    "text": doc["text"]
                }
            }

# Count the total number of documents for the progress bar
with open("../data/trec-covid/corpus.jsonl", 'r', encoding='utf-8') as f:
    total_docs = sum(1 for _ in f)

# Initialize the progress bar
progress = tqdm(unit="docs", total=total_docs)

successes = 0
for ok, action in streaming_bulk(client=es, actions=generate_documents("../data/trec-covid/corpus.jsonl"), chunk_size=500):
    progress.update(1)
    successes += int(ok)

progress.close()
print(f"✅ Indexed {successes}/{total_docs} documents into '{INDEX_NAME}'")

100%|██████████| 171332/171332 [00:44<00:00, 3877.28docs/s]

✅ Indexed 171332/171332 documents into 'ir2025-index'





## 4. NLTK Setup and Corpus Preprocessing

Download required NLTK data, load the IR2025 corpus into memory, and define a Python function to simulate the `custom_english` analyzer for downstream TF–IDF modeling.

We download the necessary NLTK corpora and models.

- **Tokenization & POS Tagging**: `punkt_tab`, `averaged_perceptron_tagger`  
- **Stopword List**: `stopwords`  
- **Lexical Database**: `wordnet` and the multilingual WordNet (`omw-1.4`)

In [20]:
import nltk
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\mitsa\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\mitsa\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\mitsa\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\mitsa\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\mitsa\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

We load the IR2025 corpus into memory as a list of JSON objects using `jsonlines.open`, preparing it for TF–IDF vectorization and analysis.

In [21]:
with jsonlines.open('../data/trec-covid/corpus.jsonl') as reader:
    corpus = [obj for obj in reader]

Simulate `custom_english` Analyzer in Python

Here we replicate our Elasticsearch `custom_english` analyzer pipeline in pure Python using NLTK:

1. **Lowercase & Trim**  
2. **Punctuation Removal**  
3. **Tokenization** (`word_tokenize`)  
4. **Stopword Filtering** (`stopwords.words('english')`)  
5. **Stemming** (PorterStemmer as a proxy for Krovetz)  

This function lets us preprocess text identically before building the TF–IDF model.

In [22]:
# Simulate custom_english Analyzer 
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer # KrovetzStemmer supports up to python 3.10 at best 
import string

# Initialize NLTK components
stop_words = set(stopwords.words('english'))

stemmer = PorterStemmer() # It's "Closer" to Korvetz than Snowball is

def es_like_preprocess(text):
    # Lowercase the text
    text = text.lower().strip()
    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    # Tokenize the text
    tokens = word_tokenize(text)
    # Remove stopwords and apply stemming (Porter)
    processed_tokens = [stemmer.stem(token) for token in tokens if token not in stop_words or not token.isalpha()]
    # Join tokens back into a single string
    return ' '.join(processed_tokens)

## 5. TF–IDF Model Construction and Persistence

In this section, we build a TF–IDF model over the preprocessed IR2025 corpus, compute key statistics, and save all artifacts for later use:

- **Preprocessing Statistics**: total tokens, unique tokens, average tokens per document  
- **TF–IDF Statistics**: vocabulary size, average/min/max IDF  
- **Model Artifacts**:  
  - `tfidf_vectorizer.joblib` (fitted `TfidfVectorizer`)  
  - `idf_scores.json` (term → IDF)  
  - `tfidf_statistics.json` (all collected statistics)

We implement three functions:

1. `build_and_save_tfidf_model(corpus, output_dir)`  
   - Preprocesses each document, updates statistics  
   - Fits `TfidfVectorizer`, extracts IDF scores  
   - Saves model and statistics to disk  

2. `load_tfidf_model(output_dir)`  
   - Loads the saved vectorizer, IDF scores, and statistics  
   - Prints summary for validation  

3. `transform_text(text, vectorizer)`  
   - Applies the loaded vectorizer to new text, returning its TF–IDF representation  


In [None]:
# from sklearn.feature_extraction.text import TfidfVectorizer
# from tqdm import tqdm
# import numpy as np
# import joblib
# import json
# import os

# def build_and_save_tfidf_model(corpus, output_dir="../models"):
#     """
#     Build TF-IDF model from corpus, compute statistics, and save the model.
    
#     Args:
#         corpus: List of documents with 'text' field
#         output_dir: Directory to save models and statistics
    
#     Returns:
#         tuple: (vectorizer, idf_scores, statistics_dict)
#     """
#     # Create output directory
#     os.makedirs(output_dir, exist_ok=True)
    
#     # Initialize counters for statistics
#     total_tokens = 0
#     unique_tokens = set()
#     statistics = {}

#     # Preprocess with detailed statistics
#     print("Preprocessing corpus...")
#     preprocessed_corpus = []
#     for doc in tqdm(corpus, desc="Preprocessing documents", unit="doc"):
#         processed_text = es_like_preprocess(doc["text"])
#         tokens = processed_text.split()
        
#         # Update statistics
#         total_tokens += len(tokens)
#         unique_tokens.update(tokens)
        
#         preprocessed_corpus.append(processed_text)

#     # Save preprocessing statistics
#     statistics['preprocessing'] = {
#         'total_tokens': total_tokens,
#         'unique_tokens': len(unique_tokens),
#         'average_tokens_per_doc': total_tokens/len(corpus)
#     }

#     print(f"\nPreprocessing statistics:")
#     print(f"- Total tokens: {total_tokens:,}")
#     print(f"- Unique tokens: {len(unique_tokens):,}")
#     print(f"- Average tokens per document: {total_tokens/len(corpus):,.1f}")

#     # Build TF-IDF model with detailed progress
#     print("\nBuilding TF-IDF model...")
#     tfidf_vectorizer = TfidfVectorizer(lowercase=True, stop_words='english')

#     with tqdm(total=3, desc="TF-IDF computation") as pbar:
#         # Fit the vectorizer
#         tfidf_vectorizer.fit(preprocessed_corpus)
#         pbar.update(1)
        
#         # Get feature names
#         feature_names = tfidf_vectorizer.get_feature_names_out()
#         pbar.update(1)
        
#         # Calculate IDF scores
#         idf_scores = dict(zip(feature_names, tfidf_vectorizer.idf_))
#         pbar.update(1)

#     # Calculate and save TF-IDF statistics
#     idf_values = list(idf_scores.values())
#     statistics['tfidf'] = {
#         'vocabulary_size': len(idf_scores),
#         'average_idf': float(np.mean(idf_values)),
#         'max_idf': float(max(idf_values)),
#         'min_idf': float(min(idf_values))
#     }

#     print("\nTF-IDF statistics:")
#     print(f"- Vocabulary size: {len(idf_scores):,} terms")
#     print(f"- Average IDF: {statistics['tfidf']['average_idf']:.2f}")
#     print(f"- Max IDF: {statistics['tfidf']['max_idf']:.2f}")
#     print(f"- Min IDF: {statistics['tfidf']['min_idf']:.2f}")

#     # Save everything
#     print("\nSaving models and statistics...")
#     try:
#         # Save vectorizer
#         joblib.dump(tfidf_vectorizer, os.path.join(output_dir, 'tfidf_vectorizer.joblib'))
        
#         # Save IDF scores
#         with open(os.path.join(output_dir, 'idf_scores.json'), 'w', encoding='utf-8') as f:
#             json.dump(idf_scores, f, ensure_ascii=False, indent=2)
            
#         # Save statistics
#         with open(os.path.join(output_dir, 'tfidf_statistics.json'), 'w', encoding='utf-8') as f:
#             json.dump(statistics, f, ensure_ascii=False, indent=2)
            
#         print("\n✅ Saved successfully:")
#         print(f"- Vectorizer: {os.path.join(output_dir, 'tfidf_vectorizer.joblib')}")
#         print(f"- IDF scores: {os.path.join(output_dir, 'idf_scores.json')}")
#         print(f"- Statistics: {os.path.join(output_dir, 'tfidf_statistics.json')}")
        
#     except Exception as e:
#         print(f"\n❌ Error saving files: {e}")
        
#     return tfidf_vectorizer, idf_scores

# # Function to load the saved model
# def load_tfidf_model(output_dir="../models"):
#     """
#     Load the saved TF-IDF model, IDF scores, and statistics.
    
#     Args:
#         output_dir: Directory where models and statistics are saved
        
#     Returns:
#         tuple: (vectorizer, idf_scores, statistics)
#     """
#     try:
#         # Load vectorizer
#         vectorizer = joblib.load(os.path.join(output_dir, 'tfidf_vectorizer.joblib'))
        
#         # Load IDF scores
#         with open(os.path.join(output_dir, 'idf_scores.json'), 'r', encoding='utf-8') as f:
#             idf_scores = json.load(f)
            
#         # Load statistics
#         with open(os.path.join(output_dir, 'tfidf_statistics.json'), 'r', encoding='utf-8') as f:
#             statistics = json.load(f)
        
#         print("\nModel validation:")
#         print(f"- Vocabulary size: {len(idf_scores):,}")
#         print(f"- Average IDF: {statistics['tfidf']['average_idf']:.2f}")
#         print(f"- Max IDF: {statistics['tfidf']['max_idf']:.2f}")
#         print(f"- Min IDF: {statistics['tfidf']['min_idf']:.2f}")
 
#         print("✅ Model loaded successfully")
#         return vectorizer, idf_scores
        
#     except Exception as e:
#         print(f"❌ Error loading model: {e}")
#         return None, None
        
# # Transform new text using the loaded vectorizer
# def transform_text(text, vectorizer):
#     """Transform new text using the loaded vectorizer"""
#     try:
#         transformed = vectorizer.transform([text])
#         return transformed
#     except Exception as e:
#         print(f"❌ Error transforming text: {e}")
#         return None

In [43]:
from sklearn.feature_extraction.text import TfidfVectorizer
from collections import Counter
from tqdm import tqdm
import numpy as np
import joblib
import json
import os

def build_and_save_tfidf_model(corpus, output_dir="../models"):
    """
    Build TF-IDF model from corpus, compute statistics, and save the model.
    
    Args:
        corpus: List of documents with 'text' field
        output_dir: Directory to save models and statistics
    
    Returns:
        tuple: (vectorizer, idf_scores, statistics_dict)
    """
    # Create output directory
    os.makedirs(output_dir, exist_ok=True)
    
    # Initialize counters for statistics
    all_tokens = []
    total_tokens = 0
    unique_tokens = set()
    statistics = {}

    # Preprocess with detailed statistics
    print("Preprocessing corpus...")
    preprocessed_corpus = []
    for doc in tqdm(corpus, desc="Preprocessing documents", unit="doc"):
        processed_text = es_like_preprocess(doc["text"])
        tokens = processed_text.split()
        all_tokens.extend(tokens)

        # Update statistics
        total_tokens += len(tokens)
        unique_tokens.update(tokens)
        
        preprocessed_corpus.append(processed_text)
    
    
    word_counts = Counter(all_tokens)
    global top_50_words
    top_50_words = [w for w, _ in word_counts.most_common(50)]
    print("Top 50 words:", top_50_words)

    # Save preprocessing statistics
    statistics['preprocessing'] = {
        'total_tokens': total_tokens,
        'unique_tokens': len(unique_tokens),
        'average_tokens_per_doc': total_tokens/len(corpus)
    }

    print(f"\nPreprocessing statistics:")
    print(f"- Total tokens: {total_tokens:,}")
    print(f"- Unique tokens: {len(unique_tokens):,}")
    print(f"- Average tokens per document: {total_tokens/len(corpus):,.1f}")

    # Build TF-IDF model with detailed progress
    print("\nBuilding TF-IDF model...")
    tfidf_vectorizer = TfidfVectorizer(lowercase=True, stop_words='english')

    with tqdm(total=3, desc="TF-IDF computation") as pbar:
        # Fit the vectorizer
        tfidf_vectorizer.fit(preprocessed_corpus)
        pbar.update(1)
        
        # Get feature names
        feature_names = tfidf_vectorizer.get_feature_names_out()
        pbar.update(1)
        
        # Calculate IDF scores
        idf_scores = dict(zip(feature_names, tfidf_vectorizer.idf_))
        pbar.update(1)

    # Calculate and save TF-IDF statistics
    idf_values = list(idf_scores.values())
    statistics['tfidf'] = {
        'vocabulary_size': len(idf_scores),
        'average_idf': float(np.mean(idf_values)),
        'max_idf': float(max(idf_values)),
        'min_idf': float(min(idf_values))
    }

    print("\nTF-IDF statistics:")
    print(f"- Vocabulary size: {len(idf_scores):,} terms")
    print(f"- Average IDF: {statistics['tfidf']['average_idf']:.2f}")
    print(f"- Max IDF: {statistics['tfidf']['max_idf']:.2f}")
    print(f"- Min IDF: {statistics['tfidf']['min_idf']:.2f}")

    # Save everything
    print("\nSaving models and statistics...")
    try:
        # Save vectorizer
        joblib.dump(tfidf_vectorizer, os.path.join(output_dir, 'tfidf_vectorizer.joblib'))
        
        # Save IDF scores
        with open(os.path.join(output_dir, 'idf_scores.json'), 'w', encoding='utf-8') as f:
            json.dump(idf_scores, f, ensure_ascii=False, indent=2)
            
        # Save statistics
        with open(os.path.join(output_dir, 'tfidf_statistics.json'), 'w', encoding='utf-8') as f:
            json.dump(statistics, f, ensure_ascii=False, indent=2)
            
        print("\n✅ Saved successfully:")
        print(f"- Vectorizer: {os.path.join(output_dir, 'tfidf_vectorizer.joblib')}")
        print(f"- IDF scores: {os.path.join(output_dir, 'idf_scores.json')}")
        print(f"- Statistics: {os.path.join(output_dir, 'tfidf_statistics.json')}")
        
    except Exception as e:
        print(f"\n❌ Error saving files: {e}")
        
    return tfidf_vectorizer, idf_scores

# Function to load the saved model
def load_tfidf_model(output_dir="../models"):
    """
    Load the saved TF-IDF model, IDF scores, and statistics.
    
    Args:
        output_dir: Directory where models and statistics are saved
        
    Returns:
        tuple: (vectorizer, idf_scores, statistics)
    """
    try:
        # Load vectorizer
        vectorizer = joblib.load(os.path.join(output_dir, 'tfidf_vectorizer.joblib'))
        
        # Load IDF scores
        with open(os.path.join(output_dir, 'idf_scores.json'), 'r', encoding='utf-8') as f:
            idf_scores = json.load(f)
            
        # Load statistics
        with open(os.path.join(output_dir, 'tfidf_statistics.json'), 'r', encoding='utf-8') as f:
            statistics = json.load(f)
        
        print("\nModel validation:")
        print(f"- Vocabulary size: {len(idf_scores):,}")
        print(f"- Average IDF: {statistics['tfidf']['average_idf']:.2f}")
        print(f"- Max IDF: {statistics['tfidf']['max_idf']:.2f}")
        print(f"- Min IDF: {statistics['tfidf']['min_idf']:.2f}")
 
        print("✅ Model loaded successfully")
        return vectorizer, idf_scores
        
    except Exception as e:
        print(f"❌ Error loading model: {e}")
        return None, None
        
# Transform new text using the loaded vectorizer
def transform_text(text, vectorizer):
    """Transform new text using the loaded vectorizer"""
    try:
        transformed = vectorizer.transform([text])
        return transformed
    except Exception as e:
        print(f"❌ Error transforming text: {e}")
        return None

## 6. Load or Build TF–IDF Model  

Attempt to load the saved TF–IDF vectorizer and IDF scores. If they are not found, build the model from the corpus and save the artifacts for future runs.

In [44]:
vectorizer, idf_scores = build_and_save_tfidf_model(corpus)

Preprocessing corpus...


Preprocessing documents: 100%|██████████| 171332/171332 [06:29<00:00, 439.48doc/s]


Top 50 words: ['patient', 'use', 'infect', 'covid19', 'studi', 'diseas', 'result', 'viru', 'cell', 'case', 'sever', 'clinic', 'method', 'health', 'respiratori', 'effect', 'group', 'includ', 'data', 'differ', 'develop', 'treatment', 'protein', 'viral', 'may', 'coronaviru', 'increas', 'model', 'associ', 'test', 'report', 'risk', 'time', 'compar', 'pandem', 'conclus', 'also', 'human', 'control', 'respons', 'system', 'activ', 'p', 'detect', 'show', 'rate', 'sarscov2', 'provid', 'present', 'identifi']

Preprocessing statistics:
- Total tokens: 16,535,640
- Unique tokens: 328,788
- Average tokens per document: 96.5

Building TF-IDF model...


TF-IDF computation: 100%|██████████| 3/3 [00:14<00:00,  4.98s/it]



TF-IDF statistics:
- Vocabulary size: 298,422 terms
- Average IDF: 11.84
- Max IDF: 12.36
- Min IDF: 2.03

Saving models and statistics...

✅ Saved successfully:
- Vectorizer: ../models\tfidf_vectorizer.joblib
- IDF scores: ../models\idf_scores.json
- Statistics: ../models\tfidf_statistics.json


In [45]:
# Test loading
vectorizer, idf_scores = load_tfidf_model()
if not vectorizer and not idf_scores:
    # Build and save the model
    vectorizer, idf_scores = build_and_save_tfidf_model(corpus)


Model validation:
- Vocabulary size: 298,422
- Average IDF: 11.84
- Max IDF: 12.36
- Min IDF: 2.03
✅ Model loaded successfully


In [30]:
import pickle
processed_sentences_path = "../data/processed_sentences.pkl"

# Try to load processed_sentences from pickle
if os.path.exists(processed_sentences_path):
    with open(processed_sentences_path, "rb") as f:
        processed_sentences = pickle.load(f)
    print(f"✅ Loaded {len(processed_sentences)} tokenized sentences from pickle.")
else:
    # Process the corpus
    processed_sentences = []
    print("⏳ Preprocessing corpus into sentences...")
    for doc in tqdm(corpus, unit="doc"):
        doc_text = doc["text"]
        sentences = sent_tokenize(doc_text)
        for sentence in sentences:
            tokens = es_like_preprocess(sentence).split()
            if tokens:
                processed_sentences.append(tokens)

    # Save to pickle
    os.makedirs(os.path.dirname(processed_sentences_path), exist_ok=True)
    with open(processed_sentences_path, "wb") as f:
        pickle.dump(processed_sentences, f)
    print(f"✅ Saved {len(processed_sentences)} tokenized sentences to pickle.")

✅ Loaded 1126604 tokenized sentences from pickle.


## 7. Train a Word2Vec Model

### 🔧 Word2Vec Hyperparameter Summary

| Parameter     | Chosen Value | Purpose                                   | Pros                                              | Cons                                               |
|---------------|-------------------|-------------------------------------------|---------------------------------------------------|----------------------------------------------------|
| `vector_size` | 200               | Dimensionality of word embeddings         | Captures semantic nuances                         | Higher means more computational cost                          |
| `window`      | 5                 | Context window size                       | Balances syntactic and semantic information       | Too large may introduce noise                      |
| `min_count`   | 5                 | Minimum frequency threshold               | Removes rare noise words                          | May exclude rare but important terms               |
| `sg`          | 1 (Skip-Gram)     | Training algorithm                        | Better for rare words                             | Slower training                                    |
| `epochs`      | 15                | Number of training iterations             | Improves convergence                              | Risk of overfitting with too many epochs           |
| `negative`    | 10                | Number of negative samples                | Enhances embedding quality                        | Too many can slow training                         |
| `sample`      | 1e-4              | Subsampling frequent words                | Reduces dominance of frequent/common words        | May remove useful frequent words if too aggressive |
| `workers`     | 6         | Number of parallel training threads       | Speeds up training                                | Overhead with too many threads                     |
| `seed`        | 42                | Random seed for reproducibility           | Ensures consistent results                        | None                                               |


In [31]:
def load_qrels(qrels_path="../data/trec-covid/qrels/test.tsv"):
    qrels = {}
    with open(qrels_path, 'r', encoding='utf-8') as f:
        reader = csv.DictReader(f, delimiter='\t')
        for row in reader:
            qid = row['query-id']
            docid = row['corpus-id']
            relevance = int(row['score'])
            qrels.setdefault(qid, {})[docid] = relevance

    relevant_counts = Counter()
    for qid, docs in qrels.items():
        relevant_counts[qid] = sum(1 for rel in docs.values() if rel > 0)
    print("Average number of relevant documents per query:", int(sum(relevant_counts.values()) / len(relevant_counts)))

    return qrels

qrels = load_qrels()

Average number of relevant documents per query: 493


In [54]:
from nltk import pos_tag, word_tokenize
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))

def is_expandable(pos):
    return pos.startswith('NN') or pos.startswith('JJ')  # nouns or adjectives

def expand_query_with_word2vec(query_text, kv_model, topn=1, n_expand=1, similarity_threshold=0.6):
    """
    Expand a query using top TF-IDF tokens (nouns/adjectives) and their most similar terms from Word2Vec.

    Args:
        query_text (str): Original query
        kv_model: Trained Word2Vec model (KeyedVectors)
        tfidf_vectorizer: Trained TF-IDF vectorizer with idf_
        topn (int): Number of similar terms per word to consider
        n_expand (int): Number of query terms to expand (highest IDF)
        similarity_threshold (float): Minimum similarity score to include term

    Returns:
        str: Expanded query
    """
    tokens = word_tokenize(query_text.lower())
    tagged = pos_tag(tokens)
    original_words = set(tokens)

    # Select candidate tokens (noun/adjective, alphabetic, non-stopword)
    candidates = [
        (word, pos) for word, pos in tagged
        if word.isalpha() and word not in stop_words and is_expandable(pos)
    ]
    
    # Get IDF scores from the vectorizer
    idf_scores = dict(zip(vectorizer.get_feature_names_out(), vectorizer.idf_))

    # Score candidate terms by IDF (after preprocessing for vectorizer lookup)
    scored = [
        (word, idf_scores.get(es_like_preprocess(word), 0.0))
        for word, _ in candidates
    ]

    # Pick top-n_expand tokens to expand
    top_words = [word for word, _ in sorted(scored, key=lambda x: -x[1])[:n_expand]]

    expanded_terms = []
    for word in top_words:
        if word in kv_model.key_to_index:
            try:
                similar_terms = kv_model.most_similar(word, topn=topn)
                for sim_word, score in similar_terms:
                    if (
                        #score >= similarity_threshold and
                        sim_word.isalpha() and
                        sim_word not in original_words
                    ):
                        expanded_terms.append(sim_word)
                        if len(expanded_terms) >= topn:
                            break  # only keep at most topn expansions total
            except KeyError:
                continue  # word not in vocab

    return query_text + " " + " ".join(expanded_terms)

In [50]:
def evaluate_query_expansion(w2v_model, validation_queries, k_values=[20, 30, 50], topn=3, n_expand=1, similarity_threshold=0.8):
    """
    Expands validation queries using Word2Vec, performs retrieval at multiple cutoff values (k),
    and returns the average MAP across all k_values.
    """
    expanded_queries = []
    for query in validation_queries:
        expanded_text = expand_query_with_word2vec(
            query["text"],
            w2v_model.wv,
            topn=topn,
            n_expand=n_expand,
            similarity_threshold=similarity_threshold
        )
        expanded_queries.append({
            "_id": query["_id"],
            "expanded_text": expanded_text
        })

    all_run_results = {}  # {k: {qid: {docid: score}}}
    for k in k_values:
        run = {}
        for query in expanded_queries:
            response = es.search(
                index=INDEX_NAME,
                query={"match": { "text": query["expanded_text"] }},
                size=k
            )
            run[query["_id"]] = {hit["_id"]: hit["_score"] for hit in response["hits"]["hits"]}
        all_run_results[k] = run

    total_map = 0.0
    valid_ks = 0

    for k, run in all_run_results.items():
        # Filter qrels
        qrels_subset = {qid: qrels[qid] for qid in run if qid in qrels}
        if not qrels_subset:
            continue

        evaluator = pytrec_eval.RelevanceEvaluator(qrels_subset, {"map"})
        results = evaluator.evaluate(run)

        if results:
            avg_map_k = sum(res["map"] for res in results.values()) / len(results)
            total_map += avg_map_k
            valid_ks += 1

    return total_map / valid_ks if valid_ks > 0 else 0.0

In [51]:
from itertools import combinations

def average_similarity_top_words(model, top_words):
    """
    Calculate the average pairwise similarity for a list of words
    using the Word2Vec model.
    
    Args:
        model: Trained gensim Word2Vec model
        top_words: List of words to measure similarity between
    
    Returns:
        float: average pairwise similarity
    """
    valid_words = [w for w in top_words if w in model.wv.key_to_index]
    pairs = list(combinations(valid_words, 2))
    if not pairs:
        return 0.0

    sims = [model.wv.similarity(w1, w2) for w1, w2 in pairs]
    avg_sim = sum(sims) / len(sims)
    return avg_sim 

In [52]:
import random
random.seed(42)

import jsonlines

with jsonlines.open('../data/trec-covid/queries.jsonl') as reader:
    queries = [obj for obj in reader]

# Sample 10 unique random indices (without replacement)
idxs = random.sample(range(len(queries)), 20)

# Get the corresponding query objects
validation_queries = [queries[i] for i in idxs]

print(f"Loaded {len(queries)} queries.")
print(f"Sampled {len(validation_queries)} validation queries.")

Loaded 50 queries.
Sampled 20 validation queries.


In [None]:
from sklearn.model_selection import ParameterGrid
from gensim.models import Word2Vec
from tqdm import tqdm
import random

# --- Define parameter space ---
# 3 × 3 × 3 × 2 × 2 × 2 = 216 configurations
param_grid = {
    'vector_size': [100, 200, 300],
    'window': [5, 8, 12],
    'negative': [5, 10, 15],
    'epochs': [10, 20],
    'min_count': [2, 5],
    'sg': [0, 1]
}

# --- Get a random subset of combinations ---
all_combinations = list(ParameterGrid(param_grid))
subset_size = 30  # Change this to control how many configs to try
selected_combinations = random.sample(all_combinations, subset_size)

# --- Run evaluation ---
best = {'score': -1}
log = []
for params in tqdm(selected_combinations, desc="Tuning Word2Vec", unit="config"):
    model = Word2Vec(processed_sentences, **params, sample=1e-4, workers=6, seed=42)
    score = evaluate_query_expansion(model, validation_queries)
    sim = average_similarity_top_words(model, top_50_words)
    log.append({'params': params, 'score': score, 'sim': sim, 'model': model}) # To call this
    if score > best['score']:
        best = {'params': params, 'score': score, 'sim': sim, 'model': model}
best_model = best['model']
print(f"\n🏆 Best parameters (for MAP): {best['params']}")
print(f"📈 Best Average MAP Score: {best['score']:.4f}")
print(f"💯 Average Similarity Score: {best['sim']:.4f}")

Tuning Word2Vec:   0%|          | 0/30 [00:00<?, ?config/s]

Tuning Word2Vec: 100%|██████████| 30/30 [3:39:18<00:00, 438.63s/config]  


🏆 Best parameters: {'epochs': 20, 'min_count': 5, 'negative': 5, 'sg': 1, 'vector_size': 200, 'window': 5}
📈 Best MAP Score: 0.0259
💯 Best Similarity Score; 0.3135





In [59]:
import os
# --- Save model ---
os.makedirs("../models", exist_ok=True)

# Save only the KeyedVectors part
best_model.wv.save("../models/best_w2v_ir2025.kv")
print("✅ Word2Vec model saved (only KeyedVectors) to: ../models/best_w2v_ir2025.kv")

✅ Word2Vec model saved (only KeyedVectors) to: ../models/best_w2v_ir2025.kv


In [60]:
from gensim.models import KeyedVectors

# --- Load model ---
kv_model = KeyedVectors.load("../models/best_w2v_ir2025.kv", mmap='r')
print("✅ Word2Vec model (only KeyedVectors) successfully loaded.")

✅ Word2Vec model (only KeyedVectors) successfully loaded.


In [61]:
print(f"Vocabulary size: {len(kv_model.key_to_index)}")

Vocabulary size: 66705


In [62]:
import numpy as np
vector_norms = np.linalg.norm(kv_model.vectors, axis=1)
print(f"Mean vector norm: {np.mean(vector_norms):.4f}")
print(f"Max vector norm: {np.max(vector_norms):.4f}")
print(f"Min vector norm: {np.min(vector_norms):.4f}")

Mean vector norm: 3.8426
Max vector norm: 8.3411
Min vector norm: 0.0397


In [63]:
expanded_queries = []
print("Expanding Queries..")
for query in tqdm(queries, unit="query"):
    new_query = query.copy()
    new_query["expanded_text"] = expand_query_with_word2vec(query["text"], kv_model)
    expanded_queries.append(new_query)

Expanding Queries..


100%|██████████| 50/50 [00:26<00:00,  1.88query/s]


In [64]:
with jsonlines.open("../data/trec-covid/queries_expanded_word2vec.jsonl", mode='w') as writer:
    for q in expanded_queries:
        writer.write(q)
    print("✅ Expanded queries saved to ../data/trec-covid/queries_expanded_word2vec.jsonl")

✅ Expanded queries saved to ../data/trec-covid/queries_expanded_word2vec.jsonl


In [65]:
def process_queries_phase_3(expanded_queries_path):
    # Load queries
    with open(expanded_queries_path, 'r', encoding='utf-8') as f:
        queries = [json.loads(line) for line in f]

    INDEX_NAME = "ir2025-index"
    k_values = [20, 30, 50]

    runs = {f"run_{k}": {} for k in k_values}
    for k in k_values:
        output_dir = f"../results/phase_3"
        os.makedirs(output_dir, exist_ok=True)

        for query in tqdm(queries, desc=f"Processing Expanded Queries with Word2Vec for run with k = {k}"):
            qid = query["_id"]
            query_text = query["expanded_text"] # already did this: expand_query_with_word2vec()
            
            response = es.search(
                index=INDEX_NAME,
                query={"match": { "text": query_text }},
                size=k
            )

            runs[f"run_{k}"][qid] = {hit["_id"]: hit["_score"] for hit in response["hits"]["hits"]}

        # Save each run
        with open(os.path.join(output_dir, f'retrieval_top_{k}.json'), 'w', encoding='utf-8') as f:
            json.dump(runs[f"run_{k}"], f, ensure_ascii=False, indent=4)
            print(f"✅ Results saved to: ../results/phase_3/retrieval_top_{k}.json")

    return runs
    
runs = process_queries_phase_3("../data/trec-covid/queries_expanded_word2vec.jsonl")

Processing Expanded Queries with Word2Vec for run with k = 20:   0%|          | 0/50 [00:00<?, ?it/s]

Processing Expanded Queries with Word2Vec for run with k = 20: 100%|██████████| 50/50 [00:00<00:00, 52.90it/s]


✅ Results saved to: ../results/phase_3/retrieval_top_20.json


Processing Expanded Queries with Word2Vec for run with k = 30: 100%|██████████| 50/50 [00:00<00:00, 73.91it/s]


✅ Results saved to: ../results/phase_3/retrieval_top_30.json


Processing Expanded Queries with Word2Vec for run with k = 50: 100%|██████████| 50/50 [00:00<00:00, 63.42it/s]

✅ Results saved to: ../results/phase_3/retrieval_top_50.json





In [66]:
def compute_metrics(qrels, runs, folder, metrics=['map', 'P_5', 'P_10', 'P_15', 'P_20']):    
    # Metrics to Evaluate
    evaluator = pytrec_eval.RelevanceEvaluator(qrels, {'map', 'P'})
    
    for run_name, run in runs.items():
        k = run_name.split("_")[1]
        print(f"Computing metrics for run with k = {k}")
        
        # Verify how many documents were retrieved per query
        # for query_id, docs in run.items():
            # num_docs = len(docs)
            # print(f"Query ID: {query_id} - Retrieved Documents: {num_docs}")
            
        results = evaluator.evaluate(run)
        
        #Print available metrics for debugging
        # first_query = list(results.keys())[0]
        # print(f"Available metrics for {first_query}: {list(results[first_query].keys())}")
        
        # Compute average metrics
        avg_scores = {metric: 0.0 for metric in metrics}
        num_queries = len(results)
        
        for res in results.values():
            for metric in metrics:
                avg_scores[metric] += res.get(metric, 0.0)
        
        for metric in metrics:
            avg_scores[metric] /= num_queries
                                                                                                                                               
        # Prepare output directory
        output_dir = os.path.join("../results", folder)
        os.makedirs(output_dir, exist_ok=True)
        
        # Save per-query metrics
        per_query_path = os.path.join(output_dir, f"per_query_metrics_top_{k}.json")
        with open(per_query_path, "w", encoding="utf-8") as f:
            json.dump(results, f, indent=4)
        
        # Save average metrics
        avg_metrics_path = os.path.join(output_dir, f"average_metrics_top_{k}.json")
        with open(avg_metrics_path, "w", encoding="utf-8") as f:
            json.dump(avg_scores, f, indent=4)
        
        print(f"✅ Per-query metrics saved to: {per_query_path}")
        print(f"✅ Average metrics saved to: {avg_metrics_path}\n")
        
compute_metrics(qrels, runs, "phase_3")

Computing metrics for run with k = 20
✅ Per-query metrics saved to: ../results\phase_3\per_query_metrics_top_20.json
✅ Average metrics saved to: ../results\phase_3\average_metrics_top_20.json

Computing metrics for run with k = 30
✅ Per-query metrics saved to: ../results\phase_3\per_query_metrics_top_30.json
✅ Average metrics saved to: ../results\phase_3\average_metrics_top_30.json

Computing metrics for run with k = 50
✅ Per-query metrics saved to: ../results\phase_3\per_query_metrics_top_50.json
✅ Average metrics saved to: ../results\phase_3\average_metrics_top_50.json



In [67]:
def compare_phases(phases, k_values=[20, 30, 50], metrics=['map', 'P_5', 'P_10', 'P_15', 'P_20']):
    """
    Display and optionally compare retrieval metrics for 1 to 4 phases.
    Parameters:
    - phases: dict mapping phase names to base file paths, e.g.
        {
            "Phase 1": "../results/phase_1/average_metrics_top_{}.json",
            "Phase 2": "../results/phase_2/average_metrics_top_{}.json",
            ...
        }
    - k_values: list of cutoff values to compare (e.g. [20, 30, 50])
    - metrics: list of TREC metric keys (e.g. ['map', 'P_5', 'P_10'])

    Returns:
    - pandas DataFrame with metrics for all phases at each k
    """
    comparison = []

    for k in k_values:
        row = {"k": k}
        for phase_name, base_path in phases.items():
            try:
                with open(base_path.format(k), "r") as f:
                    phase_metrics = json.load(f)
                row[f"{phase_name} MAP"] = phase_metrics["map"]
                for m in metrics[1:]: # exclude MAP
                    row[f"{phase_name} avgPre@{m[2:]}"] = phase_metrics[m]
            except FileNotFoundError:
                print(f"⚠️ File not found: {base_path.format(k)}")
        comparison.append(row)

    df = pd.DataFrame(comparison)
    df.sort_values("k", inplace=True)
    df.set_index("k", inplace=True) # Set 'k' column as the index for visualization purposes
    display(df)
    return df

In [68]:
phases = {
    "Phase 1": "../results/phase_1/average_metrics_top_{}.json",
    "Phase 2": "../results/phase_2/average_metrics_top_{}.json",
    "Phase 3": "../results/phase_3/average_metrics_top_{}.json",
    # "Phase 4": "../results/phase_4/average_metrics_top_{}.json"
}
_ = compare_phases(phases)

Unnamed: 0_level_0,Phase 1 MAP,Phase 1 avgPre@5,Phase 1 avgPre@10,Phase 1 avgPre@15,Phase 1 avgPre@20,Phase 2 MAP,Phase 2 avgPre@5,Phase 2 avgPre@10,Phase 2 avgPre@15,Phase 2 avgPre@20,Phase 3 MAP,Phase 3 avgPre@5,Phase 3 avgPre@10,Phase 3 avgPre@15,Phase 3 avgPre@20
k,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
20,0.020569,0.64,0.582,0.564,0.548,0.020552,0.608,0.586,0.556,0.538,0.021517,0.652,0.6,0.574667,0.555
30,0.027753,0.64,0.582,0.564,0.549,0.028369,0.608,0.586,0.556,0.538,0.028661,0.652,0.6,0.574667,0.556
50,0.039911,0.64,0.582,0.564,0.549,0.040856,0.608,0.586,0.556,0.538,0.04093,0.652,0.6,0.574667,0.556
