## Create BM25 Index
Here we create the BM25 index. This is required for our custom Hybrid search engine, wherein it will be used as the sparse retrieval layer.

#### Methodology
Some background -
- Relying on a single embedding of description is brittle.
- Dense embedding search on a large corpus of documents is time consuming
- Due to the nature of certain documents (containing thousands of rows), creating a dense-embedding of each document will mask context around certain keywords. Due to this, dense-embeddings may fail at throwing relevant documents even when there are instances of keyword match, due to the sheer size of the document, due to which the keyword matches get masked behind the dense-embeddings.

To counter this, BM25 based retrieval will be used as the first layer, because it is efficient, and less RAM intensive than dense retrieval models. These will rely on exact-term matching. These are lightweight, and will allow us to reduce the search space from thousands of candidate documents to a few hundred "possibly" relevant documents.

Having said that, here's a good read on sparse vs dense retrievals - [SPLADE for Sparse Vector Search Explained](https://www.pinecone.io/learn/splade/).

Let's get started.

#### 1. Load the CSO dump

In [1]:
from pathlib import Path
import os

root = Path().absolute().parents[1]
os.chdir(str(root))

from src.helpers.json_stat_archive_db import JSONStatArchiveDB

In [7]:
db = JSONStatArchiveDB(compression_level=12)

cso_files = {}
for tid, ds, ts in db.read("artifacts/cso_bkp/cso_archive/jsonstat_archive.sqlite", table_id=None, with_labels=True):
    cso_files[tid] = {
        "data": ds,
        "timestamp": ts,
    }

In [9]:
len(cso_files)

12435

### 2. Create the bag-of-words for each document

In [10]:
def get_bag_of_words(cso_file: str) -> str:
    """Generate a bag of words from the JSON data."""
    col_dist_vals_dic = {}

    col_ids = cso_file['id']
    label = cso_file['label']
    subject = cso_file['extension']['subject']['value']
    product = cso_file['extension']['product']['value']
    table_id = cso_file['extension']["matrix"]

    for col_id in col_ids:
        col_name = cso_file['dimension'][col_id]['label']
        if col_id.startswith("TLIST"):
            continue
        else:
            col_dist_vals_dic[col_name] = list(cso_file['dimension'][col_id]['category']['label'].values())

    # combine the label, subject, product and column name and distribution values into a single string
    combined_str = f"""{label} - {subject} - {product} - {" ".join([key for key in col_dist_vals_dic.keys()] + [" ".join(val) for val in col_dist_vals_dic.values()])}"""

    return {"id": table_id, "text": combined_str}

corpus = [get_bag_of_words(item["data"]) for item in cso_files.values()]

### 3. Create BM25 index and save them for future use

In [None]:
import bm25s
import Stemmer

# Tokenize texts
stemmer = Stemmer.Stemmer("english")
texts = [d["text"] for d in corpus]
corpus_tokens = bm25s.tokenize(texts, stopwords="en", stemmer=stemmer)

# Index (attach corpus so save/load keeps it)
retriever = bm25s.BM25(corpus=corpus)
retriever.index(corpus_tokens)

# Save the indices (corpus.jsonl will contain the table-ids/texts)
retriever.save("artifacts/bm25")

Finding newlines for mmindex: 100%|██████████| 71.1M/71.1M [00:00<00:00, 1.53GB/s]


In [36]:
import json

# EXAMPLE USAGE:

# 1. load retriever
new_retriever = retriever.load("artifacts/bm25")

# 2. load corpus
f = open("artifacts/bm25/corpus.jsonl", "r")
corpus_new = [json.loads(line) for line in f]
f.close()

# 3. ask a question, tokenize it
question = "What's the mining and quarrying production in Ireland?"
query_tokens = bm25s.tokenize(question, stemmer=stemmer)

# 4. get docs + scores (note: pass corpus=corpus; no return_as)
docs, scores = new_retriever.retrieve(query_tokens, k=10, corpus=corpus_new)

for i in range(docs.shape[1]):
    doc, score = docs[0, i], scores[0, i]
    print(f"Rank {i+1} (score: {score:.2f}): id={doc['id']} text={doc['text']}")

                                                     

Rank 1 (score: 8.05): id=CD907 text=Population Aged 15 Years and Over at Work - Census 2011 - Profile 9 - What we Know - A study of Education and Skills in Ireland - Statistic Sex Highest Level of Education Completed Detailed Industrial Group Population Aged 15 Years and Over at Work Both sexes Male Female Total education ceased and not ceased Total whose full-time education has ceased No formal education Primary Lower secondary Upper secondary Technical/vocational Advanced certificate/completed apprenticeship Higher certificate Ordinary bachelor degree/professional qualification or both Honours bachelor degree/professional qualification or both Postgraduate diploma or degree Doctorate (Ph.D.) Not stated Total whose full-time education has not ceased Growing of perennial and non-perennial crops plant propagation (011,012,013) Farming  of animals mixed farming (0141, 0142, 0144 to 0150) Hunting and agricultural related activities  (016,017) Forestry and logging (02) Fishing and aquacult

