# Hybrid Search

Hybrid search combines traditional keyword-based search with semantic search to provide more accurate and relevant results. In the RAG application, it facilitates the discovery of relevant research articles based on user queries by integrating keyword-based search with semantic search capabilities. This integration enables the application to retrieve articles that match both keywords and semantic meaning, making it particularly useful for handling complex queries involving nuanced concepts, synonyms, and related ideas.

![Hybrid Search](images/Hybrid_Search.png)


In this notebook, we will delve into the implementation details of the hybrid search approach in the RAG application, exploring how it leverages both keyword-based and semantic search techniques to provide a more effective search experience.

Here are the steps:
* [Loading chunked dataset](#loading-the-chunks-from-the-previous-steps)
* [Sparse Index](#Hybrid-Search---Sparse-Index)
* [Dense Index](#hybrid-search---dense-index)
* [Merging Results](#hybrid-search---merging-results)
* [Generating a reply with merged results](#using-merged-results-to-generate-a-reply)



In [1]:
import warnings

# Suppress warnings
warnings.filterwarnings('ignore')

## Hybrid Search - Sparse Index

We will use bm25 supported database to complement the semantic search with the vector database.

In [5]:
import bm25s
from bm25s.tokenization import Tokenizer, Tokenized
import Stemmer  # optional: for stemming

### Loading the chunks from the previous steps

We will use the chunks from the AI Arxiv dataset, we used before. These chunks were split using semantic chunking and enriched with context.

In [7]:
import json
corpus_json = json.load(open('../data/corpus.json'))

### Creating the Sparse Index

We will use an in-memory index using BM25. Many (vector) databases support BM25 natively, and many others support indexing and searching on calculated sparse vectors.

In this example, we will also define a stemmer and stop-words to clean up the text and better select the tokens/terms that will be indexed in the sparse index.

In [8]:
corpus_text = [doc["text"] for doc in corpus_json]

# optional: create a stemmer
english_stemmer = Stemmer.Stemmer("english")

# Initialize the Tokenizer with the stemmer
sparse_tokenizer = Tokenizer(
    stemmer=english_stemmer,
    lower=True, # lowercase the tokens
    stopwords="english",  # or pass a list of stopwords
    splitter=r"\w+",  # by default r"(?u)\b\w\w+\b", can also be a function
)

In [9]:
print(sparse_tokenizer.stopwords)

('a', 'an', 'and', 'are', 'as', 'at', 'be', 'but', 'by', 'for', 'if', 'in', 'into', 'is', 'it', 'no', 'not', 'of', 'on', 'or', 'such', 'that', 'the', 'their', 'then', 'there', 'these', 'they', 'this', 'to', 'was', 'will', 'with')


In [10]:
# Tokenize the corpus and only keep the ids (faster and saves memory)
corpus_sparse_tokens = (
    sparse_tokenizer
    .tokenize(
        corpus_text, 
        update_vocab=True, # update the vocab as we tokenize
        return_as="ids"
    )
)

# Create the BM25 retriever and attach your corpus_json to it
sparse_index = bm25s.BM25(corpus=corpus_json)
# Now, index the corpus_tokens (the corpus_json is not used yet)
sparse_index.index(corpus_sparse_tokens)

                                                            

In [11]:
vocab_dict = sparse_tokenizer.get_vocab_dict()
print(f"The tokenizer vocabulary includes {len(vocab_dict)} tokens/terms")

focus_token = 'context'
focus_token_index = vocab_dict.get(focus_token)
print(f"The index of the {focus_token} is {focus_token_index}")

The tokenizer vocabulary includes 1690 tokens/terms
The index of the context is 128


The tokenizer can encode (convert the text into ids) and decode (convert the ids back into text).

In [12]:
print(sparse_tokenizer.decode([[focus_token_index]]))

[['context']]


### Exploring the Sparse Index

In [13]:
print(sparse_index.scores)

{'data': array([0.73858595, 0.75115275, 1.0805569 , ..., 1.6571839 , 1.6571839 ,
       1.6571839 ], shape=(4039,), dtype=float32), 'indices': array([ 0,  9, 10, ..., 45, 45, 45], shape=(4039,), dtype=int32), 'indptr': array([   0,    0,   12, ..., 4037, 4038, 4039],
      shape=(1691,), dtype=int32), 'num_docs': 46}


For each token, the index holds the list of documents (chunks) that include it, and the score of that token in that document (chunk).

In [15]:
from IPython.display import Markdown

In [22]:
token_index = vocab_dict.get(focus_token)
print(f"Index of the token `{focus_token}` in the BM25 retriever: {token_index}")
score_index = sparse_index.scores.get('indptr')[token_index]
next_score_index = sparse_index.scores.get('indptr')[token_index+1]


print("Document ID", "Score")
max_score = max(sparse_index.scores['data'][score_index:next_score_index])

for i in range(score_index, next_score_index):
    doc_id = sparse_index.scores['indices'][i]
    doc_score = sparse_index.scores['data'][i]
    if doc_score == max_score:
        display(Markdown(f"{doc_id}: **{doc_score}**"))
    else:
        print(doc_id, doc_score)

Index of the token `context` in the BM25 retriever: 128
Document ID Score
0 0.4434834
2 0.7355118
3 0.5371069
4 0.90847206
13 0.5116058
14 0.9670208


15: **1.105641484260559**

30 0.7444794
37 0.6708645
41 0.7355118


### Searching the Sparse Index

As we are doing in the dense index, we need to tokenize and encode the query text:

In [23]:
# Query the corpus
query = "What is context size of Mixtral?"
query_tokens = (
    sparse_tokenizer
    .tokenize(
        [query], 
        update_vocab=False, 
        return_as="ids"
    )
)

print(query_tokens)

                                                     

[[128, 129, 16]]




And use the encoded query to search the sparse index:

In [24]:
# Query the corpus
sparse_results, sparse_scores = sparse_index.retrieve(query_tokens, k=10)

for i in range(sparse_results.shape[1]):
    doc, score = sparse_results[0, i], sparse_scores[0, i]
    print(f"Rank {i+1} (score: {score:.2f}): {doc}")

                                                     

Rank 1 (score: 1.99): {'id': 2, 'text': 'expertsâ ) to process the token and combine their output additively. This technique increases the number of parameters of a model while controlling cost and latency, as the model only uses a fraction of the total set of parameters per token. Mixtral is pretrained with multilingual data using a context size of 32k tokens. It either matches or exceeds the performance of Llama 2 70B and GPT-3.5, over several benchmarks. In particular, Mixture of Experts Layer i gating inputs af outputs router expert\n\nThis chunk describes the key architectural details of the Mixtral model, a sparse mixture-of-experts language model that outperforms larger models like Llama 2 70B and GPT-3.5 on various benchmarks.', 'metadata': {'title': 'Mixtral of Experts', 'arxiv_id': '2401.04088', 'references': ['1905.07830']}}
Rank 2 (score: 1.86): {'id': 14, 'text': "Active Params French Arc-c HellaS MMLU German Arc-c HellaS MMLU Spanish Arc-c HellaS MMLU Italian Arc-c HellaS



## Hybrid Search - Dense Index

For the Hybrid Search, we also need the dense index using the vector database, as we used in the previous steps. 

### Creaing the Dense Index

In [25]:
from qdrant_client import QdrantClient
from qdrant_client.http import models
from sentence_transformers import SentenceTransformer

qdrant_client = QdrantClient(
    ":memory:"
) 

# Create the embedding encoder
dense_encoder = SentenceTransformer('all-MiniLM-L6-v2') # Model to create embeddings

In [26]:
collection_name = "hybrid_search"

dense_index = qdrant_client.recreate_collection(
    collection_name=collection_name,
        vectors_config=models.VectorParams(
        size=dense_encoder.get_sentence_embedding_dimension(), # Vector size is defined by used model
        distance=models.Distance.COSINE
    )
)
print(dense_index)

True


In [27]:
# vectorize!
qdrant_client.upload_points(
    collection_name=collection_name,
    points=[
        models.PointStruct(
            id=idx,
            vector=dense_encoder.encode(doc["text"]).tolist(),
            payload=doc
        ) for idx, doc in enumerate(corpus_json) # data is the variable holding all the enriched texts
    ]
)

### Searching the Dense Index

We will start with encoding the query with the dense encoder:

In [28]:
query_vector = dense_encoder.encode(query).tolist()

And use the encoded query to search the dense index:

In [29]:
dense_results = qdrant_client.search(
    collection_name=collection_name,
    query_vector=query_vector,
    limit=10
)

In [32]:
for result in dense_results:
    print("id: ", result.id)
    print("text: ", result.payload["text"])
    print("dense_score: ", result.score)
    print("sparse_score: ", result.score)
    print("--------------------------------")

id:  15
text:  3.8 â Mixtral_8x7B 3.5 32 > $3.0 i] 228 fos a 2.0 0 5k 10k 15k 20k 25k 30k Context length Passkey Performance ry 3.8 â Mixtral_8x7B 3.5 0.8 32 > 0.6 $3.0 i] 228 04 fos 0.2 a 2.0 0.0 OK 4K 8K 12K 16K 20K 24K 28K 0 5k 10k 15k 20k 25k 30k Seq Len Context length Figure 4: Long range performance of Mixtral. (Left) Mixtral has 100% retrieval accuracy of the Passkey task regardless of the location of the passkey and length of the input sequence. (Right) The perplexity of Mixtral on the proof-pile dataset decreases monotonically as the context length increases.

The chunk discusses the long-range performance of the Mixtral model, demonstrating its ability to retrieve a passkey regardless of its location in a long input sequence, and showing that the model's perplexity on the proof-pile dataset decreases as the context length increases.
dense_score:  0.618097593406871
sparse_score:  0.618097593406871
--------------------------------
id:  4
text:  Instruct under the Apache 2.0 lic

## Hybrid Search - Merging Results

There are a few options to merge the results from the two methods (sparse and dense). In this notebook, we will use a simple weighted average.

In [33]:
documents_with_scores = []
for hit in dense_results:
    doc_id = hit.payload["id"]
    doc_text = next((doc for doc in corpus_json if doc["id"] == doc_id), None)["text"]
    doc_dense_score = hit.score
    documents_with_scores.append({
        "id": doc_id,
        "text": doc_text,
        "dense_score": doc_dense_score
    })

for i, result in enumerate(sparse_results[0]):
    doc_id = result["id"]
    doc_text = next((doc for doc in corpus_json if doc["id"] == doc_id), None)["text"]
    doc_sparse_score = sparse_scores[0][i]
    for doc in documents_with_scores:
        if doc["id"] == doc_id:
            doc["sparse_score"] = doc_sparse_score
            break




In [36]:
documents_with_scores[-1]

{'id': 11,
 'text': 'Mistral 78 % 2681 Mistral 78 3 3 s0 5 = A % 66 50 g 4 45 64 78 138 348708 78 138 348708 78 138 348 70B S66 Mixtral 8x7B 50 Mixtral 8x7B 5 = 564 340 g al Mistral 78 ee Mistral 78 3 5 Â§ 30 5 eo â = Mistral Â° 20 â e LlaMA2 78 (138 348 70B 7B (138 348 708 7B Â«13B 34B 708 Active Params Active Params Active Params Figure 3: Results on MMLU, commonsense reasoning, world knowledge and reading comprehension, math and code for Mistral (7B/8x7B) vs Llama 2 (7B/13B/70B). Mixtral largely outperforms Llama 2 70B on all benchmarks, except on reading comprehension benchmarks while using 5x lower active parameters. It is also vastly superior to Llama 2 70B on code and math. Detailed results for Mixtral, Mistral 7B and Llama 2 7B/13B/70B and Llama 1 34B2 are reported in Table 2. Figure 2 compares the performance of Mixtral with the Llama models in different categories. Mixtral surpasses Llama 2 70B across most metrics. In particular, Mixtral displays a superior performance in cod

In [39]:
len(documents_with_scores)

10

In [42]:
for i in range(len(documents_with_scores)):
    print("id: ", documents_with_scores[i]["id"])
    print("text: ", documents_with_scores[i]["text"])
    print("dense_score: ", documents_with_scores[i]["dense_score"])
    try:
        print("sparse_score: ", documents_with_scores[i]["sparse_score"])
    except:
        pass
    print("--------------------------------")

id:  15
text:  3.8 â Mixtral_8x7B 3.5 32 > $3.0 i] 228 fos a 2.0 0 5k 10k 15k 20k 25k 30k Context length Passkey Performance ry 3.8 â Mixtral_8x7B 3.5 0.8 32 > 0.6 $3.0 i] 228 04 fos 0.2 a 2.0 0.0 OK 4K 8K 12K 16K 20K 24K 28K 0 5k 10k 15k 20k 25k 30k Seq Len Context length Figure 4: Long range performance of Mixtral. (Left) Mixtral has 100% retrieval accuracy of the Passkey task regardless of the location of the passkey and length of the input sequence. (Right) The perplexity of Mixtral on the proof-pile dataset decreases monotonically as the context length increases.

The chunk discusses the long-range performance of the Mixtral model, demonstrating its ability to retrieve a passkey regardless of its location in a long input sequence, and showing that the model's perplexity on the proof-pile dataset decreases as the context length increases.
dense_score:  0.618097593406871
sparse_score:  1.333729
--------------------------------
id:  4
text:  Instruct under the Apache 2.0 license1, fr

We will normalize the scores of each index, and than calculate a weighted score that gives more weight (0.8) to the dense index.

In [43]:
import numpy as np

# Normalize the two types of scores
dense_scores = np.array([doc.get("dense_score", 0) for doc in documents_with_scores])
sparse_scores = np.array([doc.get("sparse_score", 0) for doc in documents_with_scores])

dense_scores_normalized = (dense_scores - np.min(dense_scores)) / (np.max(dense_scores) - np.min(dense_scores))
sparse_scores_normalized = (sparse_scores - np.min(sparse_scores)) / (np.max(sparse_scores) - np.min(sparse_scores))

# Calculate a weighted score with alpha of 0.2 to the sparse score
alpha = 0.2
weighted_scores = (1 - alpha) * dense_scores_normalized + alpha * sparse_scores_normalized

# Pick up the top 3 documents with the weighted score
top_docs = sorted(
    zip(
        documents_with_scores, 
        weighted_scores
    ), 
    key=lambda x: x[1], 
    reverse=True
)[:3]



In [46]:
top_docs

[({'id': 15,
   'text': "3.8 â Mixtral_8x7B 3.5 32 > $3.0 i] 228 fos a 2.0 0 5k 10k 15k 20k 25k 30k Context length Passkey Performance ry 3.8 â Mixtral_8x7B 3.5 0.8 32 > 0.6 $3.0 i] 228 04 fos 0.2 a 2.0 0.0 OK 4K 8K 12K 16K 20K 24K 28K 0 5k 10k 15k 20k 25k 30k Seq Len Context length Figure 4: Long range performance of Mixtral. (Left) Mixtral has 100% retrieval accuracy of the Passkey task regardless of the location of the passkey and length of the input sequence. (Right) The perplexity of Mixtral on the proof-pile dataset decreases monotonically as the context length increases.\n\nThe chunk discusses the long-range performance of the Mixtral model, demonstrating its ability to retrieve a passkey regardless of its location in a long input sequence, and showing that the model's perplexity on the proof-pile dataset decreases as the context length increases.",
   'dense_score': 0.618097593406871,
   'sparse_score': np.float32(1.333729)},
  np.float64(0.933914139866595)),
 ({'id': 4,
   'te

## Using merged results to generate a reply

We can now take the merged results and call the LLM to generate the reply to the user's query.

In [47]:
# define a variable to hold the search results for the generation model
search_results = [doc[0]['text'] for doc in top_docs]

In [48]:
from dotenv import load_dotenv

load_dotenv()

True

In [49]:
# Now time to connect to the large language model
from openai import OpenAI

client = OpenAI()
model_name = "gpt-4.1-mini-2025-04-14"
completion = client.chat.completions.create(
    model=model_name,
    messages=[
        {"role": "system", "content": "You are chatbot, an research expert. Your top priority is to help guide users to understand reserach papers."},
        {"role": "user", "content": query},
        {"role": "assistant", "content": str(search_results)}
    ]
)

response_text = str(completion.choices[0].message.content)

In [51]:
print("hybrid search reply to ", query,":", response_text)

hybrid search reply to  What is context size of Mixtral? : The context size of Mixtral is 32,000 tokens (32k tokens).


Saving the retrieved documents to be used in the next reranking notebook, which demonstrates a more advanced method to merge Hybrid Search results.

In [52]:
import json

with open('../data/dense_results.json', 'w') as f:
    json.dump([dense_result.payload for dense_result in dense_results], f, default=str)

with open('../data/sparse_results.json', 'w') as f:
    json.dump([sparse_result for sparse_result in sparse_results[0]], f, default=str)

