<a href="https://colab.research.google.com/github/MichalSlowakiewicz/RAG_hybrid_search/blob/master/rag_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a href="https://colab.research.google.com/github/MichalSlowakiewicz/RAG_hybrid_search/blob/master/RAG_project_fixed_with_colab.ipynb" target="_blank">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>


In [None]:
!pip install -q sentence-transformers datasets rank_bm25 scikit-learn nltk tqdm

In [None]:
from sentence_transformers import SentenceTransformer
from datasets import load_dataset
from rank_bm25 import BM25Okapi
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.neighbors import NearestNeighbors
import numpy as np
import nltk
from nltk.tokenize import word_tokenize
from tqdm.auto import tqdm
import math
import multiprocessing
nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

We employ the ***sentence-transformers/all-MiniLM-L6-v2*** - lightweight, free model that provides efficient text embeddings. Although it isn't the most powerful model available, it offers a good trade-off between quality and computational cost.


Each document is divided into overlapping **passages of 200 tokens**. This defines the granularity of retrieval: smaller chunks capture finer-grained context but may lose coherence, while larger ones risk missing short relevant spans.


To prevent boundary effects where relevant text spans fall between two chunks, we introduce an **overlap of 50 tokens**. This ensures contextual continuity between adjacent passages and reduces information loss during segmentation.

We retrieve the **top-3** passages for each query. This parameter controls the number of candidates used for evaluation metrics such as Recall@K and MRR@K. Increasing it may improve recall but also increases computational cost.


To accelerate experiments, we restrict **evaluation to 200 randomly selected questions** from the validation set. Setting this to None runs the evaluation on the full dataset, yielding more stable but slower results.


Let us set the parameter **$\alpha$ to 0.65** - same level as in the paper (https://aclanthology.org/2025.regnlp-1.5v2.pdf). $\alpha$ is the weight of embeddings' similarity in final score, while $1 - \alpha$ is the weight of BM25 score.  Later we'll change the value of alpha in an attempt to find the optimal value of this parameter.

In [None]:
# Setting parameters
EMBEDDING_MODEL = "sentence-transformers/all-MiniLM-L6-v2"
CHUNK_SIZE = 200
CHUNK_OVERLAP = 50
TOP_K = 3
EVAL_SAMPLE = 200
ALPHA = 0.65
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)

In [None]:
# Loading dataset
ds = load_dataset("squad", split="validation")
print("Number of examples:", len(ds))

if EVAL_SAMPLE is not None and EVAL_SAMPLE < len(ds):
    ds = ds.shuffle(seed=RANDOM_SEED).select(range(EVAL_SAMPLE))
    print("Number of examples used in experiment:", len(ds))

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

'(ReadTimeoutError("HTTPSConnectionPool(host='huggingface.co', port=443): Read timed out. (read timeout=10)"), '(Request ID: cce7a7c2-2ebf-484c-9299-82ff4f698fb0)')' thrown while requesting HEAD https://huggingface.co/datasets/rajpurkar/squad/resolve/7b6d24c440a36b6815f21b70d25016731768db1f/.huggingface.yaml
Retrying in 1s [Retry 1/5].


plain_text/train-00000-of-00001.parquet:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

plain_text/validation-00000-of-00001.par(…):   0%|          | 0.00/1.82M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

Number of examples: 10570
Number of examples used in experiment: 200


In [None]:
# Examples
for i in range(5):
    print(f"Example {i+1}")
    print("Question:", ds[i]["question"])
    print("Answer:", ds[i]["answers"]["text"][0])
    print("Context:", ds[i]["context"])
    print("-" * 80)

Example 1
Question: In what year did Massachusetts first require children to be educated in schools?
Answer: 1852
Context: Private schooling in the United States has been debated by educators, lawmakers and parents, since the beginnings of compulsory education in Massachusetts in 1852. The Supreme Court precedent appears to favor educational choice, so long as states may set standards for educational accomplishment. Some of the most relevant Supreme Court case law on this is as follows: Runyon v. McCrary, 427 U.S. 160 (1976); Wisconsin v. Yoder, 406 U.S. 205 (1972); Pierce v. Society of Sisters, 268 U.S. 510 (1925); Meyer v. Nebraska, 262 U.S. 390 (1923).
--------------------------------------------------------------------------------
Example 2
Question: When were stromules discovered?
Answer: 1962
Context: The chloroplast membranes sometimes protrude out into the cytoplasm, forming a stromule, or stroma-containing tubule. Stromules are very rare in chloroplasts, and are much more comm

In [None]:
from math import ceil

def tokenize_for_chunking(text):
    return word_tokenize(text)

def chunk_context(context, chunk_size=CHUNK_SIZE, overlap=CHUNK_OVERLAP):
    tokens = tokenize_for_chunking(context)
    if len(tokens) <= chunk_size:
        return [" ".join(tokens)]
    chunks = []
    start = 0
    while start < len(tokens):
        end = start + chunk_size
        chunk_tokens = tokens[start:end]
        chunks.append(" ".join(chunk_tokens))
        if end >= len(tokens):
            break
        start = end - overlap
    return chunks

# Building corpus of passages
passages = []
for i, ex in enumerate(tqdm(ds)):
    context = ex['context']
    chunks = chunk_context(context)
    for c in chunks:
        passages.append({"passage": c, "source_qid": i, "example": ex})
print("Number of passages:", len(passages))

  0%|          | 0/200 [00:00<?, ?it/s]

Number of passages: 233


In [None]:
tokenized_corpus = [word_tokenize(p['passage'].lower()) for p in passages]

#Building BM25 index
bm25 = BM25Okapi(tokenized_corpus)

In [None]:
# Model for creating embedding vectors
model = SentenceTransformer(EMBEDDING_MODEL)

texts = [p['passage'] for p in passages]

# Creating embedding vectors
passage_embeddings = model.encode(texts, show_progress_bar=True, convert_to_numpy=True, batch_size=64)
print("Embeddings shape:", passage_embeddings.shape)

# index for finding nearest neighbours according to cosine distance
nn = NearestNeighbors(n_neighbors=TOP_K, metric='cosine', n_jobs=-1)
nn.fit(passage_embeddings)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/4 [00:00<?, ?it/s]

Embeddings shape: (233, 384)


In [None]:
from collections import defaultdict

def retrieve_bm25(query, top_k=TOP_K):
    tokens = word_tokenize(query.lower())
    scores = bm25.get_scores(tokens)  # length = num passages
    top_idx = np.argsort(scores)[-top_k:][::-1]
    top_scores = scores[top_idx]
    return list(zip(top_idx.tolist(), top_scores.tolist()))

def retrieve_embedding(query, top_k=TOP_K):
    q_emb = model.encode([query], convert_to_numpy=True)
    # sklearn NearestNeighbors with metric='cosine' returns distances; similarity = 1 - distance
    dists, idxs = nn.kneighbors(q_emb, n_neighbors=top_k)
    dists = dists[0]
    idxs = idxs[0]
    sims = 1.0 - dists  # convert to similarity
    return list(zip(idxs.tolist(), sims.tolist()))

"""
def retrieve_hybrid(query, alpha=ALPHA, top_k=TOP_K):
    # get a larger candidate set (union of BM25 top_k and embedding top_k), then combine
    bm = retrieve_bm25(query, top_k=top_k)
    em = retrieve_embedding(query, top_k=top_k)
    candidate_ids = list({i for i,_ in bm} | {i for i,_ in em})
    bm_scores_all = []
    em_scores_all = []
    # collect raw scores
    for pid in candidate_ids:
        # BM25 score
        bm_score = bm25.get_scores(word_tokenize(query.lower()))[pid]
        bm_scores_all.append(bm_score)
        # emb score
        q_emb = model.encode([query], convert_to_numpy=True)
        emb_sim = cosine_similarity(q_emb, passage_embeddings[pid:pid+1])[0][0]
        em_scores_all.append(emb_sim)
    bm_scores = np.array(bm_scores_all)
    em_scores = np.array(em_scores_all)
    # normalize both to [0,1] per-query to make them comparable (min-max)
    def minmax(x):
        if len(x)==0:
            return x
        xmin, xmax = x.min(), x.max()
        if xmax - xmin < 1e-8:
            return np.ones_like(x)*0.5
        return (x - xmin) / (xmax - xmin)
    bm_norm = minmax(bm_scores)
    em_norm = minmax(em_scores)
    final_scores = alpha * em_norm + (1-alpha) * bm_norm
    # sort
    order = np.argsort(final_scores)[::-1]
    top_order = order[:top_k]
    results = [(candidate_ids[i], float(final_scores[i])) for i in top_order]
    return results
    """

'def retrieve_hybrid(query, alpha=ALPHA, top_k=TOP_K):\n    # get a larger candidate set (union of BM25 top_k and embedding top_k), then combine\n    bm = retrieve_bm25(query, top_k=top_k)\n    em = retrieve_embedding(query, top_k=top_k)\n    candidate_ids = list({i for i,_ in bm} | {i for i,_ in em})\n    bm_scores_all = []\n    em_scores_all = []\n    # collect raw scores\n    for pid in candidate_ids:\n        # BM25 score\n        bm_score = bm25.get_scores(word_tokenize(query.lower()))[pid]\n        bm_scores_all.append(bm_score)\n        # emb score\n        q_emb = model.encode([query], convert_to_numpy=True)\n        emb_sim = cosine_similarity(q_emb, passage_embeddings[pid:pid+1])[0][0]\n        em_scores_all.append(emb_sim)\n    bm_scores = np.array(bm_scores_all)\n    em_scores = np.array(em_scores_all)\n    # normalize both to [0,1] per-query to make them comparable (min-max)\n    def minmax(x):\n        if len(x)==0:\n            return x\n        xmin, xmax = x.min(), x.m

In [None]:
def retrieve_hybrid(query, alpha=ALPHA, top_k=TOP_K):
    tokens = word_tokenize(query.lower())
    bm_scores = bm25.get_scores(tokens)
    q_emb = model.encode([query], convert_to_numpy=True)
    em_scores = np.array([cosine_similarity(q_emb, passage_embeddings[i:i+1])[0][0]
                          for i in range(len(passage_embeddings))])

    def minmax(x):
        xmin, xmax = x.min(), x.max()
        if xmax - xmin < 1e-8:
            return np.ones_like(x) * 0.5
        return (x - xmin) / (xmax - xmin)

    bm_norm = minmax(bm_scores)
    em_norm = minmax(em_scores)
    final_scores = alpha * em_norm + (1 - alpha) * bm_norm
    top_idx = np.argsort(final_scores)[-top_k:][::-1]
    results = [(int(i), float(final_scores[i])) for i in top_idx]
    return results


In [None]:
def normalize_text(s):
    return " ".join(word_tokenize(s.lower()))

def passage_contains_answer(passage_text, answers_list):
    p = normalize_text(passage_text)
    for a in answers_list:
        if normalize_text(a) in p:
            return True
    return False

In [None]:
def evaluate_retriever(retrieve_fn, queries_dataset, top_k=TOP_K):
    hits_at_k = 0
    rr_total = 0.0
    n = len(queries_dataset)
    for qi, ex in enumerate(tqdm(queries_dataset)):
        q_text = ex['question']
        answers = ex['answers']['text']  # list of acceptable answers
        retrieved = retrieve_fn(q_text, top_k=top_k)
        found = False
        rr = 0.0
        for rank, (pid, score) in enumerate(retrieved, start=1):
            if passage_contains_answer(passages[pid]['passage'], answers):
                found = True
                rr = 1.0 / rank
                break
        hits_at_k += int(found)
        rr_total += rr
    recall = hits_at_k / n
    mrr = rr_total / n
    return {"recall@{}".format(top_k): recall, "mrr@{}".format(top_k): mrr, "n": n}

# Evaluating for BM25, Embedding, Hybrid scoring
sample_ds = ds
print("Eval queries:", len(sample_ds))

res_bm25 = evaluate_retriever(retrieve_bm25, sample_ds, top_k=TOP_K)
res_emb  = evaluate_retriever(retrieve_embedding, sample_ds, top_k=TOP_K)
res_hyb  = evaluate_retriever(lambda q, top_k: retrieve_hybrid(q, alpha=ALPHA, top_k=top_k), sample_ds, top_k=TOP_K)

print("BM25:", res_bm25)
print("Embedding:", res_emb)
print("Hybrid (alpha={}):".format(ALPHA), res_hyb)

Eval queries: 200


  0%|          | 0/200 [00:00<?, ?it/s]

  0%|          | 0/200 [00:00<?, ?it/s]

  0%|          | 0/200 [00:00<?, ?it/s]

BM25: {'recall@3': 0.905, 'mrr@3': 0.8250000000000002, 'n': 200}
Embedding: {'recall@3': 0.93, 'mrr@3': 0.8566666666666667, 'n': 200}
Hybrid (alpha=0.65): {'recall@3': 0.955, 'mrr@3': 0.91, 'n': 200}


In [None]:
alphas = [0.0, 0.25, 0.5, 0.65, 0.8, 1.0]  # 0.0 = pure BM25, 1.0 = pure semantic
results_by_alpha = {}
for a in alphas:
    res = evaluate_retriever(lambda q, top_k: retrieve_hybrid(q, alpha=a, top_k=top_k), sample_ds, top_k=TOP_K)
    results_by_alpha[a] = res
    print("alpha", a, res)

  0%|          | 0/200 [00:00<?, ?it/s]

alpha 0.0 {'recall@3': 0.905, 'mrr@3': 0.8250000000000002, 'n': 200}


  0%|          | 0/200 [00:00<?, ?it/s]

alpha 0.25 {'recall@3': 0.94, 'mrr@3': 0.8766666666666668, 'n': 200}


  0%|          | 0/200 [00:00<?, ?it/s]

alpha 0.5 {'recall@3': 0.96, 'mrr@3': 0.9050000000000001, 'n': 200}


  0%|          | 0/200 [00:00<?, ?it/s]

alpha 0.65 {'recall@3': 0.955, 'mrr@3': 0.91, 'n': 200}


  0%|          | 0/200 [00:00<?, ?it/s]

alpha 0.8 {'recall@3': 0.955, 'mrr@3': 0.915, 'n': 200}


  0%|          | 0/200 [00:00<?, ?it/s]

alpha 1.0 {'recall@3': 0.93, 'mrr@3': 0.8566666666666667, 'n': 200}


In [None]:

import csv
with open("hybrid_results.csv", "w", newline="") as f:
    writer = csv.writer(f)
    writer.writerow(["alpha", "recall@{}".format(TOP_K), "mrr@{}".format(TOP_K)])
    for a, r in results_by_alpha.items():
        writer.writerow([a, r["recall@{}".format(TOP_K)], r["mrr@{}".format(TOP_K)]])
print("Saved hybrid_results.csv")

Saved hybrid_results.csv
