# BM25 Document Search and Analysis

This Information Retrieval System allows users to index a corpus of documents and perform searches using a BM25 ranking model. The system is designed for processing large datasets efficiently by utilizing a SPIMI (Single Pass In-Memory Indexing) indexing technique.

## Features

- **Tokenizer**: Supports tokenization with options for case normalization, stopword removal, and stemming.
- **Indexing**: Indexes documents in batches, allowing for scalability.
- **Searching**: Implements the BM25 ranking model for retrieving relevant documents based on queries.

In [2]:
import sys
import os
import time
import ujson
from typing import List, Dict

# Add parent directory to path to import from src
sys.path.append('..')

# Import our custom modules
from src import data_processing as dp
from src import bm25_search as bm25
from src import evaluation as ndcg 

## Configuration

The tokenizer processes text into tokens suitable for indexing and searching. It includes the following options:

- **Case Normalization**: Converts all tokens to lowercase if `lowercase=True`.
- **Minimum Token Length**: Discards tokens shorter than `min_token_length` (default is 3).
- **Stopword Removal**: Removes common stopwords if a stopword list is provided.
- **Stemming**: Reduces tokens to their stem forms using the Snowball Stemmer if `stem=True`.

In [3]:
# Paths configuration
CORPUS_PATH = '../data/MEDLINE_2024_Baseline.jsonl'
OUTPUT_DIR = '../output'
QUESTIONS_PATH = '../data/questions.jsonl'  # Optional: for batch processing

# Tokenizer configuration
TOKENIZER_CONFIG = {
    'min_token_length': 3,
    'lowercase': True,
    'stem': True,
    'stopwords': None  # Can provide a set of stopwords
}

# BM25 parameters
BM25_PARAMS = {
    'k1': 1.2,  # Term frequency saturation
    'b': 0.75   # Document length normalization
}

# Indexing parameters
BATCH_SIZE = 10000
N_RESULTS = 100


## Corpus Indexing

Run this section to build or select a existing index file. 

### Optimization Techniques in Indexing

- **SPIMI Algorithm**: Implements the Single-Pass In-Memory Indexing algorithm to efficiently handle large datasets by writing partial indexes to disk.
- **Batch Processing**: Indexes documents in batches (`batch_size=10000`) to manage memory usage.
- **Precompiled Regular Expressions**: Uses precompiled regex patterns in the tokenizer to improve tokenization speed.
- **Stemming Cache**: Caches stemmed tokens to avoid redundant computations during tokenization.
- **MessagePack Serialization**: Uses `MessagePack` for efficient binary serialization when writing partial and merged indexes to disk.

### Index File Format
- **Format**: The index is stored as a MessagePack (`.msgpack`) file.

- **Structure**:
  - **Index**: A dictionary where keys are terms and values are dictionaries mapping document IDs to lists of positions where the term occurs.
  - **Document Lengths**: A dictionary mapping document IDs to the total number of tokens in each document.

The (`.msgpack`) file is a binary format, making it impossible to provide a screenshot of its contents. 
However, a sample from the file is shown below:

```json
{
  "index": {
    "ethylhexyl": {
      "25153068": [7, 49],
      "32745991": [3, 26],
      "12437285": [8, 10],
      "15924484": [5, 27],
      "19555962": [150],
      "12270607": [8, 29],
      "23356645": [3, 22],
      "22041199": [9, 17],
      "8333024": [3, 25],
      "20453712": [14, 22],
      "30960727": [97],
      "37536456": [13],
      "14687758": [99],
      "35843048": [21],
      "16956469": [22],
      "34788783": [1, 39, 50, 52, 83, 117, 265, 281],
      "14998748": [14, 17, 28],
      "32610232": [0, 48],
      "14556481": [32],
      "35859238": [35],
      "28661659": [84, 92],
      "31033968": [12],
        .
        .
        .
    },
        .
        .
        .
  },
  "doc_lengths": {
    "2451706": 115,
    "35308048": 192,
    "7660250": 51,
    "28963802": 143,
    "25153068": 231,
    "874026": 101,
    "4001859": 137,
    "10149271": 92,
    "35267334": 190,
    "3656477": 128,
    "30818862": 217,
        .
        .
        .
  }
}
```

### Build Index

In [9]:
print(f"Building index from corpus: {CORPUS_PATH}")
print(f"Output directory: {OUTPUT_DIR}")
print(f"Batch size: {BATCH_SIZE}")
print(f"Tokenizer config: {TOKENIZER_CONFIG}")
    
start_time = time.time()

# Index documents
index_path = dp.index_documents(
    corpus_path=CORPUS_PATH,
    output_dir=OUTPUT_DIR,
    batch_size=BATCH_SIZE,
    tokenizer_config=TOKENIZER_CONFIG
)

elapsed_time = time.time() - start_time
print(f"\nIndexing completed in {elapsed_time:.2f} seconds")
print(f"Index saved at: {index_path}")

Building index from corpus: ../data/MEDLINE_2024_Baseline.jsonl
Output directory: ../output
Batch size: 10000
Tokenizer config: {'min_token_length': 3, 'lowercase': True, 'stem': True, 'stopwords': None}
Merging partial index: ../output/partial_index_0.msgpack
Merging partial index: ../output/partial_index_1.msgpack
Merging partial index: ../output/partial_index_2.msgpack
Merging partial index: ../output/partial_index_3.msgpack
Merging partial index: ../output/partial_index_4.msgpack
Merging partial index: ../output/partial_index_5.msgpack
Merging partial index: ../output/partial_index_6.msgpack
Merging partial index: ../output/partial_index_7.msgpack
Merging partial index: ../output/partial_index_8.msgpack
Merging partial index: ../output/partial_index_9.msgpack
Merging partial index: ../output/partial_index_10.msgpack
Merging partial index: ../output/partial_index_11.msgpack
Merging partial index: ../output/partial_index_12.msgpack
Merging partial index: ../output/partial_index_13.ms

### Set Existing Index Path (if there's a `merged_index.msgpack` file already)

In [4]:
index_path = os.path.join(OUTPUT_DIR, 'merged_index.msgpack')
print(f"Using existing index: {index_path}")

Using existing index: ../output/merged_index.msgpack


## Load tokenizer configuration

In [5]:
# Load tokenizer configuration
config_path = os.path.join(OUTPUT_DIR, 'tokenizer_config.msgpack')

if os.path.exists(config_path):
    tokenizer_config = dp.load_tokenizer_config(config_path)
    print("Loaded tokenizer configuration:")
    for key, value in tokenizer_config.items():
        print(f"  {key}: {value}")
else:
    tokenizer_config = TOKENIZER_CONFIG
    print("Using default tokenizer configuration")

Loaded tokenizer configuration:
  min_token_length: 3
  lowercase: True
  stopwords: None
  stem: True


## Batch Query Processing

Process multiple queries from a file.

In [8]:
# Load queries from file
def load_queries(file_path: str) -> List[Dict]:
    """Load queries from a JSONL file."""
    queries = []
    with open(file_path, 'r') as f:
        for line in f:
            queries.append(ujson.loads(line))
    return queries

# Process batch queries if file exists
if os.path.exists(QUESTIONS_PATH):
    print(f"Loading queries from: {QUESTIONS_PATH}")
    queries = load_queries(QUESTIONS_PATH)
    print(f"Loaded {len(queries)} queries")
    
    # Process all queries
    start_time = time.time()
    batch_results = bm25.batch_search(
        queries=queries,
        index_path=index_path,
        tokenizer_config=tokenizer_config,
        n_results=N_RESULTS,
        k1=BM25_PARAMS['k1'],
        b=BM25_PARAMS['b']
    )
    batch_time = time.time() - start_time
    
    print(f"\nProcessed {len(batch_results)} queries in {batch_time:.2f} seconds")
    print(f"Average time per query: {batch_time/len(batch_results):.3f} seconds")
    
    # Save results
    output_file = os.path.join(OUTPUT_DIR, 'ranked_questions.jsonl')
    with open(output_file, 'w') as f:
        for entry in batch_results:
            f.write(ujson.dumps(entry) + '\n')
    
    print(f"\nResults saved to: {output_file}")
else:
    print(f"Questions file not found at: {QUESTIONS_PATH}")

Loading queries from: ../data/questions.jsonl
Loaded 100 queries
Search: Is erenumab effective for trigeminal neuralgia?
Done with this one...
Search: What is the first indication for lurasidone?
Done with this one...
Search: Can other vaccines be given with COVID-19 vaccine?
Done with this one...
Search: What is Sublocade?
Done with this one...
Search: Is music therapy effective for pain management in neonates?
Done with this one...
Search: What is the mechanisms of action of Gilteritinib?
Done with this one...
Search: What is the reason for N-acetylgalactosamine (GalNAc) conjugation of siRNAs?
Done with this one...
Search: What is synthetic lethality?
Done with this one...
Search: How many injections of CLS-TA did the patients participating in the PEACHTREE trial receive?
Done with this one...
Search: What are the most commonly used diagnostic tests for the diagnosis of Duchenne muscular dystrophy?
Done with this one...
Search: Is there any association between Tripe palms and cancer?

## Evaluate Retrieved Documents

In [10]:
# Compute nDCG for the given results
ndcg.compute_average_ndcg(
    questions_file_path=QUESTIONS_PATH,
    results_file_path=output_file,
    k=10
    )

Query ID: 63f73f1b33942b094c000008, nDCG@10: 1.0000
Query ID: 643d41e757b1c7a315000037, nDCG@10: 0.7000
Query ID: 643c88a257b1c7a315000030, nDCG@10: 0.2824
Query ID: 64403c4257b1c7a31500004f, nDCG@10: 0.6309
Query ID: 6441302d57b1c7a315000056, nDCG@10: 0.6625
Query ID: 63f042e2f36125a426000022, nDCG@10: 0.3801
Query ID: 64184483690f196b51000038, nDCG@10: 0.8226
Query ID: 643de76757b1c7a315000039, nDCG@10: 0.4993
Query ID: 64403ab057b1c7a31500004d, nDCG@10: 1.0000
Query ID: 64179139690f196b5100002f, nDCG@10: 0.0736
Query ID: 63f02b50f36125a426000014, nDCG@10: 0.2022
Query ID: 6411b678201352f04a000036, nDCG@10: 0.6180
Query ID: 643bc8f957b1c7a31500002b, nDCG@10: 0.5916
Query ID: 64403be357b1c7a31500004e, nDCG@10: 0.3155
Query ID: 644289c457b1c7a31500005e, nDCG@10: 0.5965
Query ID: 63f02ec1f36125a426000017, nDCG@10: 0.0000
Query ID: 641c516d690f196b5100003f, nDCG@10: 0.6326
Query ID: 64371c5957b1c7a31500002a, nDCG@10: 0.5972
Query ID: 6440396957b1c7a31500004b, nDCG@10: 0.6309
Query ID: 64