# MILVUS Demo - Multi Vector Hybrid Search with Reranking using MilvusClient

# Hybrid (dense+sparse embedding) Search with Milvus in watsonx.data

## Disclaimers
- Use only Projects and Spaces that are available in watsonx context.

## Overview
### Audience
This notebook demonstrates how to implement advanced search techniques using a hybrid search approach combined with reranking strategies.
The scenario presented in this notebook assumes you have a dataset requiring effective search capabilities, where a combination of traditional search (e.g., keyword-based) and modern embedding-based techniques is leveraged to improve retrieval quality. The reranking process further enhances the relevance of results by applying a secondary scoring mechanism to the initial search output.
This notebook is designed for people interested in building or improving search systems. 

Some familiarity with Python programming, search algorithms, and basic machine learning concepts is recommended. The code runs with Python 3.10 or later.
### Learning goal
This notebook demonstrates similarity search support in watsonx.data using hybrid approach using dense embeddings and sparse embeddings, introducing commands for:
- Connecting to Milvus
- Creating collections
- Creating indexes
- Generate Dense Embeddings
- Generate Sparse Embeddings
- Ingesting data
- Data retrieval


### About Milvus 

Milvus is an open-source vector database designed specifically for scalable similarity search and AI applications. It's a powerful platform that enables efficient storage, indexing, and retrieval of vector embeddings, which are crucial in modern machine learning and artificial intelligence tasks.[ To know more, visit Milvus Documentation](https://www.ibm.com/docs/en/watsonx/watsonxdata/2.1.x?topic=components-milvus)

### Milvus: Three Fundamental Steps

#### 1. Data Preparation
Collect and convert your data into high-dimensional vector embeddings. These vectors are typically generated using machine learning models like neural networks, which transform text, images, audio, or other data types into dense numerical representations that capture semantic meaning and relationships.

#### 2. Vector Insertion
Load the dense vector embeddings and sparse vector embeddings into Milvus collections or partitions within a database. Milvus creates indexes to optimize subsequent search operations, supporting various indexing algorithms like IVF-FLAT, HNSW, etc., based on the definition.

#### 3. Similarity Search
Perform vector similarity searches by providing a query vector and a reranking weight. Milvus will rapidly return the most similar vectors from the collection or partitions based on the defined metrics like cosine similarity, Euclidean distance, or inner product and the reranking weight.

### Why Hybrid search?
Hybrid vector search leverages the combined strengths of sparse and dense vectors to deliver more accurate and relevant results. Dense vectors excel at capturing the deep semantic meaning of data, enabling better understanding and context-aware retrieval. Sparse vectors, on the other hand, are effective for capturing specific features or exact matches, such as keyword occurrences. By integrating both semantic context and feature matching, hybrid search ensures an approach that outperforms using than either method alone, making it particularly powerful for real-world search scenarios.

### Key Workflow

1. **Definition** (once)
2. **Ingestion** (once)
3. **Retrieve relevant passage(s)** (for every user query)

## Contents

- Environment Setup
- Install packages
- Document data loading
- Create connection
- Ingest data
- Retrieve relevant data

## Environment Setup

Before using the sample code in this notebook, complete the following setup tasks:

- Create a Watsonx.data instance (a free plan is offered)
  - Information about creating a watsonx.data instance can be found [here](https://www.ibm.com/docs/en/watsonx/watsonxdata/2.0.x)



## Install required packages

In [1]:
%%capture
!pip install numpy


In [2]:
%%capture
!pip install torch

In [3]:
%%capture
!pip install sentence-transformers


### Install Pymilvus SDK

In [None]:
# !pip install pymilvus
# Restart Kernal
!pip show pymilvus

Name: pymilvus
Version: 2.5.3
Summary: Python Sdk for Milvus
Home-page: 
Author: 
Author-email: Milvus Team <milvus-team@zilliz.com>
License: 
Location: /opt/homebrew/lib/python3.11/site-packages
Requires: grpcio, milvus-lite, pandas, protobuf, python-dotenv, setuptools, ujson
Required-by: 
Note: you may need to restart the kernel to use updated packages.


### Post pymilvus installations

In [5]:

import psutil
import sys
import numpy as np
import time
import pandas as pd
from pymilvus import MilvusClient, DataType, CollectionSchema, FieldSchema,utility,connections,Collection


In [6]:
%%capture
%pip install "pymilvus[model]"


In [7]:
%%capture
%pip install tensorflow


In [8]:
%%capture
%pip install --upgrade transformers


#### Setting Up BM25 and SentenceTransformer for Text Analysis

In [9]:
from pymilvus.model.sparse.bm25.tokenizers import build_default_analyzer
from pymilvus.model.sparse import BM25EmbeddingFunction



  from .autonotebook import tqdm as notebook_tqdm


In [10]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

In [11]:

analyzer = build_default_analyzer(language="en")

bm25_ef = BM25EmbeddingFunction(analyzer)


# Data preparation

In [12]:
import pandas as pd


# Create the DataFrame from the data
data = [
    {
        "id": 1,
        "text": "Python is a high-level programming language. It is known for simplicity and readability.",
        "context": "Developed by Guido van Rossum, this language emerged in 1991 and has since gained widespread adoption. It focuses on clarity and conciseness, favoring a straightforward approach to programming. The design emphasizes readable code and features a broad standard library. The language supports various programming styles and paradigms, making it versatile for different types of projects."
    },
    {
        "id": 2,
        "text": "Python supports object-oriented programming. Classes and objects are fundamental concepts.",
        "context": "The language allows for defining custom types through classes. This approach enables the creation of new data structures with associated methods. Encapsulation, inheritance, and polymorphism are core principles, facilitating modular and reusable code. Objects represent instances of classes and can encapsulate both data and behavior, promoting organized and efficient software design."
    },
    {
        "id": 3,
        "text": "Python has a comprehensive standard library. It provides modules for various tasks.",
        "context": "Included with the language, this extensive library offers tools and modules for a wide range of tasks, from file handling to internet protocols. It covers numerous domains such as data manipulation, networking, and system operations. This built-in support reduces the need for external packages and simplifies development by providing robust and tested components."
    },
    {
        "id": 4,
        "text": "Python features dynamic typing. Variables are not bound to a specific type.",
        "context": "In this language, types are determined at runtime rather than compile-time. This flexibility allows variables to hold different types of data during execution. Type checking is performed dynamically, and the language can adapt to various data types and structures as needed, simplifying coding but requiring careful management to avoid type-related errors."
    },
    {
        "id": 5,
        "text": "Python uses significant whitespace. Indentation affects code structure.",
        "context": "Code organization relies on indentation levels to define blocks of code, such as loops and conditionals. This approach enforces a uniform style and readability, making the structure of the code visually apparent. Proper indentation is essential for defining the scope and flow of control structures, promoting clean and maintainable code."
    },
    {
        "id": 6,
        "text": "Python supports multiple programming paradigms. It includes procedural and functional styles.",
        "context": "The language accommodates various styles of programming, allowing developers to choose the approach that best fits their needs. Procedural programming focuses on procedures and functions, while functional programming emphasizes the use of pure functions and immutability. This versatility supports diverse programming techniques and allows for flexible solutions to different problems."
    },
    {
        "id": 7,
        "text": "Python features automatic memory management. Garbage collection is used to manage resources.",
        "context": "The language includes a built-in mechanism for managing memory allocation and deallocation. Unused objects are automatically reclaimed through garbage collection, which helps prevent memory leaks and optimize resource usage. This feature simplifies memory management for developers and reduces the likelihood of errors related to manual memory handling."
    },
    {
        "id": 8,
        "text": "Python is interpreted, not compiled. Code is executed line by line.",
        "context": "The language processes and executes code directly without a separate compilation step. Each line of code is executed sequentially, allowing for interactive development and immediate feedback. This approach contrasts with compiled languages, where code is translated into machine language before execution. The interpreted nature facilitates debugging and rapid prototyping."
    },
    {
        "id": 9,
        "text": "Python is widely used in data science. Libraries such as NumPy and pandas are popular.",
        "context": "The language has become a leading choice in the field of data analysis and scientific computing. It offers robust libraries and tools designed for handling and analyzing large datasets. NumPy provides support for numerical operations and array manipulations, while pandas offers data structures and functions for data analysis and manipulation. These libraries contribute to Python's strong presence in data science."
    },
    {
        "id": 10,
        "text": "Python supports web development frameworks. Django and Flask are notable examples.",
        "context": "The language provides several frameworks to streamline the development of web applications. Django is a high-level framework that follows the 'batteries-included' philosophy, offering built-in features for rapid development. Flask, on the other hand, is a lightweight framework that provides more flexibility and control. Both frameworks facilitate web development by providing tools and structures for building robust web applications."
    }
]

df = pd.DataFrame(data)



# Save the DataFrame to a CSV file
df.to_csv('milvus_input_data.csv', index=False)

print("DataFrame saved to 'milvus_input_data.csv'")


DataFrame saved to 'milvus_input_data.csv'


In [13]:
df

Unnamed: 0,id,text,context
0,1,Python is a high-level programming language. I...,"Developed by Guido van Rossum, this language e..."
1,2,Python supports object-oriented programming. C...,The language allows for defining custom types ...
2,3,Python has a comprehensive standard library. I...,"Included with the language, this extensive lib..."
3,4,Python features dynamic typing. Variables are ...,"In this language, types are determined at runt..."
4,5,Python uses significant whitespace. Indentatio...,Code organization relies on indentation levels...
5,6,Python supports multiple programming paradigms...,The language accommodates various styles of pr...
6,7,Python features automatic memory management. G...,The language includes a built-in mechanism for...
7,8,"Python is interpreted, not compiled. Code is e...",The language processes and executes code direc...
8,9,Python is widely used in data science. Librari...,The language has become a leading choice in th...
9,10,Python supports web development frameworks. Dj...,The language provides several frameworks to st...


### Connect to Milvus

In [14]:
from pymilvus import MilvusClient
# Replace Placeholder Values <> with respective provisioned Milvus Values .

uri = "https://<host>:<port>"  # Construct URI from host and port
user = "<>"
password = "<>"
# Create an instance of the MilvusClient class with the new configuration
"""
#On Prem
milvus_client = MilvusClient(uri=uri,
                            user=user,
                            password=password,
                            secure=True,
                            server_pem_path='<>',
                            server_name='<>',)

# SaaS
milvus_client = MilvusClient(uri=uri, 
                             user=user, 
                             password=password,
                             secure=True,
                             server_name='<>',)
"""


In [15]:
COLLECTION_NAME = "Milvus_test_hybrid"
DIMENSION = 384
BATCH_SIZE = 2
TOPK = 1
fmt = "=== {:30} ==="
search_latency_fmt = "search latency = {:.4f}s"

In [16]:
if milvus_client.has_collection(collection_name=COLLECTION_NAME):
    milvus_client.drop_collection(collection_name=COLLECTION_NAME)

In [17]:
milvus_client.has_collection(collection_name=COLLECTION_NAME)

False

## Create Milvus schema 
[more about schema](https://www.ibm.com/docs/en/watsonx/watsonxdata/2.1.x?topic=milvus-connecting-service#taskconctmilvus__postreq__1)


In [18]:
# Create schema
schema = milvus_client.create_schema(
    auto_id=False,
    enable_dynamic_field=True,
)

# Add fields to schema
schema.add_field(field_name="id", datatype=DataType.INT64, is_primary=True),
schema.add_field(field_name="text", datatype=DataType.VARCHAR, max_length=65535),
schema.add_field(field_name="context", datatype=DataType.VARCHAR, max_length=65535),
schema.add_field(field_name="text_embedding", datatype=DataType.SPARSE_FLOAT_VECTOR),
schema.add_field(field_name="context_embedding", datatype=DataType.FLOAT_VECTOR, dim=DIMENSION)

{'auto_id': False, 'description': '', 'fields': [{'name': 'id', 'description': '', 'type': <DataType.INT64: 5>, 'is_primary': True, 'auto_id': False}, {'name': 'text', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 65535}}, {'name': 'context', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 65535}}, {'name': 'text_embedding', 'description': '', 'type': <DataType.SPARSE_FLOAT_VECTOR: 104>}, {'name': 'context_embedding', 'description': '', 'type': <DataType.FLOAT_VECTOR: 101>, 'params': {'dim': 384}}], 'enable_dynamic_field': True}

## Create Index 
[more on indexes](https://www.ibm.com/docs/en/watsonx/watsonxdata/2.1.x?topic=milvus-connecting-service#taskconctmilvus__postreq__1)


In [19]:
# Create index parameters
index_params = milvus_client.prepare_index_params()

# Add first index for text_embedding
index_params.add_index(
    field_name="text_embedding",
    index_type="SPARSE_INVERTED_INDEX",
    metric_type="IP",
    params={"drop_ratio_build": 0.2}
)

# Add second index for context_embedding
index_params.add_index(
    field_name="context_embedding",
    index_type="IVF_SQ8",
    metric_type="L2",
    params={"nlist": 128}
)


## Create Collection and Load Data 

In [20]:
# Create index and load collection
milvus_client.create_collection(
    collection_name=COLLECTION_NAME,
    schema=schema,
    index_params=index_params
)

# Load the collection
milvus_client.load_collection(collection_name=COLLECTION_NAME)

## Generate Embeddings

We are going to generate 2 type of embeddings. One using sentence transformer which is more context aware (sematic), while the second using BM25 - sparse vector embeddings based on TF-IDF focused on keyword search. Click to know more about [Dense](https://github.ibm.com/Aldrin-Dennis1/milvus-enhanced-documentation/blob/main/in-memmory-indexes-and-similarity-metrics.md) and [Sparse](https://github.ibm.com/Aldrin-Dennis1/milvus-enhanced-documentation/blob/main/in-memmory-indexes-and-similarity-metrics-sparse-embeddings.md) vector embeddings or refer [In-Memory Index](https://milvus.io/docs/index.md?tab=floating) 

### 1. Dense embeddings - Sentence Transformer Embeddings - semantic 

In [21]:


# Generate embeddings
df['context_embedding'] = df['context'].apply(lambda x: model.encode(x).tolist())

# Show the DataFrame with embeddings
df


Unnamed: 0,id,text,context,context_embedding
0,1,Python is a high-level programming language. I...,"Developed by Guido van Rossum, this language e...","[-0.024910371750593185, 0.021515710279345512, ..."
1,2,Python supports object-oriented programming. C...,The language allows for defining custom types ...,"[-0.04435064271092415, 0.04121103510260582, -0..."
2,3,Python has a comprehensive standard library. I...,"Included with the language, this extensive lib...","[-0.04649648442864418, 0.023093894124031067, -..."
3,4,Python features dynamic typing. Variables are ...,"In this language, types are determined at runt...","[0.016168083995580673, 0.04423769935965538, -0..."
4,5,Python uses significant whitespace. Indentatio...,Code organization relies on indentation levels...,"[-0.04521821811795235, 0.03502519428730011, -0..."
5,6,Python supports multiple programming paradigms...,The language accommodates various styles of pr...,"[0.010998032987117767, 0.06036316603422165, -0..."
6,7,Python features automatic memory management. G...,The language includes a built-in mechanism for...,"[0.00335799902677536, 0.08728592842817307, -0...."
7,8,"Python is interpreted, not compiled. Code is e...",The language processes and executes code direc...,"[-0.04446705803275108, 0.04414321854710579, -0..."
8,9,Python is widely used in data science. Librari...,The language has become a leading choice in th...,"[-0.0573277585208416, 0.01097246166318655, -0...."
9,10,Python supports web development frameworks. Dj...,The language provides several frameworks to st...,"[-0.07559645920991898, -0.04762483015656471, -..."


### 2. Sparse Embeddings - BM25 Embeddings - keyword based

In [22]:
corpus = df['text'].tolist()
corpus

['Python is a high-level programming language. It is known for simplicity and readability.',
 'Python supports object-oriented programming. Classes and objects are fundamental concepts.',
 'Python has a comprehensive standard library. It provides modules for various tasks.',
 'Python features dynamic typing. Variables are not bound to a specific type.',
 'Python uses significant whitespace. Indentation affects code structure.',
 'Python supports multiple programming paradigms. It includes procedural and functional styles.',
 'Python features automatic memory management. Garbage collection is used to manage resources.',
 'Python is interpreted, not compiled. Code is executed line by line.',
 'Python is widely used in data science. Libraries such as NumPy and pandas are popular.',
 'Python supports web development frameworks. Django and Flask are notable examples.']

In [23]:

tokens = []
for i in corpus:
    tokens.append(analyzer(i))
print("tokens:", tokens)

tokens: [['python', 'high-level', 'program', 'languag', 'known', 'simplic', 'readabl'], ['python', 'support', 'object-ori', 'program', 'class', 'object', 'fundament', 'concept'], ['python', 'comprehens', 'standard', 'librari', 'provid', 'modul', 'various', 'task'], ['python', 'featur', 'dynam', 'type', 'variabl', 'bound', 'specif', 'type'], ['python', 'use', 'signific', 'whitespac', 'indent', 'affect', 'code', 'structur'], ['python', 'support', 'multipl', 'program', 'paradigm', 'includ', 'procedur', 'function', 'style'], ['python', 'featur', 'automat', 'memori', 'manag', 'garbag', 'collect', 'use', 'manag', 'resourc'], ['python', 'interpret', 'compil', 'code', 'execut', 'line', 'line'], ['python', 'wide', 'use', 'data', 'scienc', 'librari', 'numpi', 'panda', 'popular'], ['python', 'support', 'web', 'develop', 'framework', 'django', 'flask', 'notabl', 'exampl']]


In [24]:
import os

os.environ["TOKENIZERS_PARALLELISM"] = "false"
bm25_ef = BM25EmbeddingFunction(analyzer)

bm25_ef.fit(corpus)


In [25]:
# Create embeddings for the documents
docs_embeddings = bm25_ef.encode_documents(corpus)

# Print embeddings
print("Embeddings:", docs_embeddings)
# Since the output embeddings are in a 2D csr_array format, we convert them to a list for easier manipulation.
#print("Sparse dim:", bm25_ef.dim, list(docs_embeddings)[0].shape)

Embeddings:   (0, 0)	1.0758262872695923
  (0, 1)	1.0758262872695923
  (0, 2)	1.0758262872695923
  (0, 3)	1.0758262872695923
  (0, 4)	1.0758262872695923
  (0, 5)	1.0758262872695923
  (0, 6)	1.0758262872695923
  (1, 0)	1.0165339708328247
  (1, 2)	1.0165339708328247
  (1, 7)	1.0165339708328247
  (1, 8)	1.0165339708328247
  (1, 9)	1.0165339708328247
  (1, 10)	1.0165339708328247
  (1, 11)	1.0165339708328247
  (1, 12)	1.0165339708328247
  (2, 0)	1.0165339708328247
  (2, 13)	1.0165339708328247
  (2, 14)	1.0165339708328247
  (2, 15)	1.0165339708328247
  (2, 16)	1.0165339708328247
  (2, 17)	1.0165339708328247
  (2, 18)	1.0165339708328247
  (2, 19)	1.0165339708328247
  (3, 0)	1.0165339708328247
  (3, 20)	1.0165339708328247
  :	:
  (6, 44)	0.9156094789505005
  (7, 0)	1.0758262872695923
  (7, 31)	1.0758262872695923
  (7, 45)	1.0758262872695923
  (7, 46)	1.0758262872695923
  (7, 47)	1.0758262872695923
  (7, 48)	1.5043045282363892
  (8, 0)	0.9634358882904053
  (8, 15)	0.9634358882904053
  (8, 26)	0.

In [26]:
docs_embeddings.shape

(10, 62)

In [27]:
# Convert into a format the Milvus expects
import numpy as np

def csr_to_dict_list(csr_array):
    result = []
    indptr = csr_array.indptr
    indices = csr_array.indices
    data = csr_array.data

    for i in range(csr_array.shape[0]):
        start, end = indptr[i], indptr[i+1]
        row_indices = indices[start:end]
        row_data = data[start:end]
        row_dict = dict(zip(row_indices, row_data))
        result.append(row_dict)
    
    return result

# Use the function
converted_data = csr_to_dict_list(docs_embeddings)
converted_data

[{0: 1.0758263,
  1: 1.0758263,
  2: 1.0758263,
  3: 1.0758263,
  4: 1.0758263,
  5: 1.0758263,
  6: 1.0758263},
 {0: 1.016534,
  2: 1.016534,
  7: 1.016534,
  8: 1.016534,
  9: 1.016534,
  10: 1.016534,
  11: 1.016534,
  12: 1.016534},
 {0: 1.016534,
  13: 1.016534,
  14: 1.016534,
  15: 1.016534,
  16: 1.016534,
  17: 1.016534,
  18: 1.016534,
  19: 1.016534},
 {0: 1.016534,
  20: 1.016534,
  21: 1.016534,
  22: 1.4453635,
  23: 1.016534,
  24: 1.016534,
  25: 1.016534},
 {0: 1.016534,
  26: 1.016534,
  27: 1.016534,
  28: 1.016534,
  29: 1.016534,
  30: 1.016534,
  31: 1.016534,
  32: 1.016534},
 {0: 0.9634359,
  2: 0.9634359,
  7: 0.9634359,
  33: 0.9634359,
  34: 0.9634359,
  35: 0.9634359,
  36: 0.9634359,
  37: 0.9634359,
  38: 0.9634359},
 {0: 0.9156095,
  20: 0.9156095,
  26: 0.9156095,
  39: 0.9156095,
  40: 0.9156095,
  41: 1.3403311,
  42: 0.9156095,
  43: 0.9156095,
  44: 0.9156095},
 {0: 1.0758263,
  31: 1.0758263,
  45: 1.0758263,
  46: 1.0758263,
  47: 1.0758263,
  48: 

In [28]:
df['text_embedding'] = converted_data
df = df[['id','text','context','text_embedding','context_embedding']]
df

Unnamed: 0,id,text,context,text_embedding,context_embedding
0,1,Python is a high-level programming language. I...,"Developed by Guido van Rossum, this language e...","{0: 1.0758263, 1: 1.0758263, 2: 1.0758263, 3: ...","[-0.024910371750593185, 0.021515710279345512, ..."
1,2,Python supports object-oriented programming. C...,The language allows for defining custom types ...,"{0: 1.016534, 2: 1.016534, 7: 1.016534, 8: 1.0...","[-0.04435064271092415, 0.04121103510260582, -0..."
2,3,Python has a comprehensive standard library. I...,"Included with the language, this extensive lib...","{0: 1.016534, 13: 1.016534, 14: 1.016534, 15: ...","[-0.04649648442864418, 0.023093894124031067, -..."
3,4,Python features dynamic typing. Variables are ...,"In this language, types are determined at runt...","{0: 1.016534, 20: 1.016534, 21: 1.016534, 22: ...","[0.016168083995580673, 0.04423769935965538, -0..."
4,5,Python uses significant whitespace. Indentatio...,Code organization relies on indentation levels...,"{0: 1.016534, 26: 1.016534, 27: 1.016534, 28: ...","[-0.04521821811795235, 0.03502519428730011, -0..."
5,6,Python supports multiple programming paradigms...,The language accommodates various styles of pr...,"{0: 0.9634359, 2: 0.9634359, 7: 0.9634359, 33:...","[0.010998032987117767, 0.06036316603422165, -0..."
6,7,Python features automatic memory management. G...,The language includes a built-in mechanism for...,"{0: 0.9156095, 20: 0.9156095, 26: 0.9156095, 3...","[0.00335799902677536, 0.08728592842817307, -0...."
7,8,"Python is interpreted, not compiled. Code is e...",The language processes and executes code direc...,"{0: 1.0758263, 31: 1.0758263, 45: 1.0758263, 4...","[-0.04446705803275108, 0.04414321854710579, -0..."
8,9,Python is widely used in data science. Librari...,The language has become a leading choice in th...,"{0: 0.9634359, 15: 0.9634359, 26: 0.9634359, 4...","[-0.0573277585208416, 0.01097246166318655, -0...."
9,10,Python supports web development frameworks. Dj...,The language provides several frameworks to st...,"{0: 0.9634359, 7: 0.9634359, 55: 0.9634359, 56...","[-0.07559645920991898, -0.04762483015656471, -..."


# Ingestion

In [29]:
# Prepare data for insertion as an array of dictionaries
data_to_insert = [
    {
        "id": int(row["id"]),
        "text": str(row["text"]),
        "context": str(row["context"]),
        "text_embedding": row["text_embedding"],
        "context_embedding": row["context_embedding"],
    }
    for _, row in df.iterrows()
]

# Insert data into the Milvus collection
res = milvus_client.insert(
    collection_name=COLLECTION_NAME,
    data=data_to_insert
)

print("Data inserted successfully:", res)


Data inserted successfully: {'insert_count': 10, 'ids': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}


## Query

- The first query is more a keyword search and we believe BM25 can give better results while 2nd query is a contextual search where sentence transformer is better. We will adjust the reranking weights accordingly for both query and observe how the results vary.

In [30]:
query1 = ["Does Python supports object-oriented programming?"]
query2 = ["Why is python so picky about tabs and spaces?"]


### Query 1

In [31]:

#With query1 - keyword search
bm25_query_embedding = bm25_ef.encode_queries(query1)

print("Embeddings:", bm25_query_embedding)


Embeddings:   (0, 0)	0.4211035966873169
  (0, 2)	0.7621400356292725
  (0, 7)	0.7621400356292725
  (0, 8)	1.8458267450332642


In [32]:
import numpy as np
import scipy.sparse as sp

def csr_to_dict_list(csr_matrix):
    result = []
    indptr = csr_matrix.indptr
    indices = csr_matrix.indices
    data = csr_matrix.data

    for i in range(csr_matrix.shape[0]):
        start, end = indptr[i], indptr[i+1]
        row_indices = indices[start:end]
        row_data = data[start:end]
        row_dict = dict(zip(row_indices, row_data))
        result.append(row_dict)
    
    return result

bm25_query_embedding = csr_to_dict_list(bm25_query_embedding)
print("Query text BM25 Embeddings:", bm25_query_embedding)

Query text BM25 Embeddings: [{0: 0.4211036, 2: 0.76214004, 7: 0.76214004, 8: 1.8458267}]


In [33]:

#With query1 - semantic search
st_query_embeddings = model.encode(query1)
print("Query text ST Embeddings:", st_query_embeddings.shape)



Query text ST Embeddings: (1, 384)


In [34]:
from pymilvus import AnnSearchRequest


search_param_1 = {
    "data": bm25_query_embedding, # Query vector
    "anns_field": "text_embedding", # Vector field name
    "param": {
        "metric_type": "IP", # This parameter value must be identical to the one used in the collection schema
        "params": {"drop_ratio_search": "0.2"}
    },
    "limit": 2 # Number of search results to return in this AnnSearchRequest
}
request_1 = AnnSearchRequest(**search_param_1)

search_param_2 = {
    "data": st_query_embeddings, # Query vector
    "anns_field": "context_embedding", # Vector field name
    "param": {
        "metric_type": "L2", # This parameter value must be identical to the one used in the collection schema
        "params": {"nprobe": 10}
    },
    
    "limit": 2 # Number of search results to return in this AnnSearchRequest
}
request_2 = AnnSearchRequest(**search_param_2)

reqs = [request_1,request_2]


In [35]:
from pymilvus import WeightedRanker
rerank1 = WeightedRanker(0.2,0.8)  
rerank2 = WeightedRanker(0.8,0.2)  



- For the query - "Does Python supports object-oriented programming?", we gave a higher weight to BM25 using rerank2 and a lower weight to sentence transformer.

In [36]:
# Load the collection
milvus_client.load_collection(
    collection_name=COLLECTION_NAME
)

res = milvus_client.hybrid_search(
    collection_name=COLLECTION_NAME,
    output_fields=["id","text","context"],
    reqs=reqs, # List of AnnSearchRequests created in step 1
    ranker=rerank2, # Reranking strategy specified in step 2
    limit=2, # Number of final search results to return   
)

print("Retrieved Results:")
for hit in res[0]:
    entity = hit['entity']
    print(f"\nId: {entity['id']}\nText: {entity['text']}\nContext: {entity['context']}")
     

Retrieved Results:

Id: 2
Text: Python supports object-oriented programming. Classes and objects are fundamental concepts.
Context: The language allows for defining custom types through classes. This approach enables the creation of new data structures with associated methods. Encapsulation, inheritance, and polymorphism are core principles, facilitating modular and reusable code. Objects represent instances of classes and can encapsulate both data and behavior, promoting organized and efficient software design.

Id: 6
Text: Python supports multiple programming paradigms. It includes procedural and functional styles.
Context: The language accommodates various styles of programming, allowing developers to choose the approach that best fits their needs. Procedural programming focuses on procedures and functions, while functional programming emphasizes the use of pure functions and immutability. This versatility supports diverse programming techniques and allows for flexible solutions to 


- For the query - "Does Python supports object-oriented programming?", we gave a higher weight to sentence transformer using rerank1.

In [None]:
# Load the collection
milvus_client.load_collection(
    collection_name=COLLECTION_NAME
)

res = milvus_client.hybrid_search(
    collection_name=COLLECTION_NAME,
    output_fields=["id","text","context"],
    reqs=reqs, # List of AnnSearchRequests created in step 1
    ranker=rerank1, # Reranking strategy specified in step 2
    limit=2, # Number of final search results to return   
)


In [38]:
print("Retrieved Results:")
for hit in res[0]:
    entity = hit['entity']
    print(f"\nId: {entity['id']}\nText: {entity['text']}\nContext: {entity['context']}")


Retrieved Results:

Id: 10
Text: Python supports web development frameworks. Django and Flask are notable examples.
Context: The language provides several frameworks to streamline the development of web applications. Django is a high-level framework that follows the 'batteries-included' philosophy, offering built-in features for rapid development. Flask, on the other hand, is a lightweight framework that provides more flexibility and control. Both frameworks facilitate web development by providing tools and structures for building robust web applications.

Id: 9
Text: Python is widely used in data science. Libraries such as NumPy and pandas are popular.
Context: The language has become a leading choice in the field of data analysis and scientific computing. It offers robust libraries and tools designed for handling and analyzing large datasets. NumPy provides support for numerical operations and array manipulations, while pandas offers data structures and functions for data analysis an

### Query 2

In [39]:
#With query2 - keyword search
bm25_query_embedding = bm25_ef.encode_queries(query2)

print("Embeddings:", bm25_query_embedding)


Embeddings:   (0, 0)	0.4211035966873169


In [40]:
bm25_query_embedding = csr_to_dict_list(bm25_query_embedding)
print("Query text BM25 Embeddings:", bm25_query_embedding)

Query text BM25 Embeddings: [{0: 0.4211036}]


In [41]:

#With query2 - semantic search
st_query_embeddings = model.encode(query2)
print("Query text ST Embeddings:", st_query_embeddings.shape)



Query text ST Embeddings: (1, 384)


In [42]:
from pymilvus import AnnSearchRequest


search_param_1 = {
    "data": bm25_query_embedding, # Query vector
    "anns_field": "text_embedding", # Vector field name
    "param": {
        "metric_type": "IP", # This parameter value must be identical to the one used in the collection schema
        "params": {"drop_ratio_search": "0.2"}
    },
    "limit": 2 # Number of search results to return in this AnnSearchRequest
}
request_1 = AnnSearchRequest(**search_param_1)

search_param_2 = {
    "data": st_query_embeddings, # Query vector
    "anns_field": "context_embedding", # Vector field name
    "param": {
        "metric_type": "L2", # This parameter value must be identical to the one used in the collection schema
        "params": {"nprobe": 10}
    },
    
    "limit": 2 # Number of search results to return in this AnnSearchRequest
}
request_2 = AnnSearchRequest(**search_param_2)

reqs = [request_1,request_2]



- For the query - "Why is python so picky about tabs and spaces?", we gave a higher weight to sentence transformer over BM25.

In [None]:
# Load the collection
milvus_client.load_collection(
    collection_name=COLLECTION_NAME
)

res = milvus_client.hybrid_search(
    collection_name=COLLECTION_NAME,
    output_fields=["id","text","context"],
    reqs=reqs, # List of AnnSearchRequests created in step 1
    ranker=rerank1, # Reranking strategy specified in step 2
    limit=2, # Number of final search results to return   
)

print("Retrieved Results:")
for hit in res[0]:
    entity = hit['entity']
    print(f"\nId: {entity['id']}\nText: {entity['text']}\nContext: {entity['context']}")



Retrieved Results:

Id: 5
Text: Python uses significant whitespace. Indentation affects code structure.
Context: Code organization relies on indentation levels to define blocks of code, such as loops and conditionals. This approach enforces a uniform style and readability, making the structure of the code visually apparent. Proper indentation is essential for defining the scope and flow of control structures, promoting clean and maintainable code.

Id: 9
Text: Python is widely used in data science. Libraries such as NumPy and pandas are popular.
Context: The language has become a leading choice in the field of data analysis and scientific computing. It offers robust libraries and tools designed for handling and analyzing large datasets. NumPy provides support for numerical operations and array manipulations, while pandas offers data structures and functions for data analysis and manipulation. These libraries contribute to Python's strong presence in data science.


- For the query - "Why is python so picky about tabs and spaces?", we gave a higher weight to BM25 using rerank1.

In [None]:
# Load the collection
milvus_client.load_collection(
    collection_name=COLLECTION_NAME
)

res = milvus_client.hybrid_search(
    collection_name=COLLECTION_NAME,
    output_fields=["id","text","context"],
    reqs=reqs, # List of AnnSearchRequests created in step 1
    ranker=rerank2, # Reranking strategy specified in step 2
    limit=2, # Number of final search results to return   
)

print("Retrieved Results:")
for hit in res[0]:
    entity = hit['entity']
    print(f"\nId: {entity['id']}\nText: {entity['text']}\nContext: {entity['context']}")



Retrieved Results:

Id: 1
Text: Python is a high-level programming language. It is known for simplicity and readability.
Context: Developed by Guido van Rossum, this language emerged in 1991 and has since gained widespread adoption. It focuses on clarity and conciseness, favoring a straightforward approach to programming. The design emphasizes readable code and features a broad standard library. The language supports various programming styles and paradigms, making it versatile for different types of projects.

Id: 8
Text: Python is interpreted, not compiled. Code is executed line by line.
Context: The language processes and executes code directly without a separate compilation step. Each line of code is executed sequentially, allowing for interactive development and immediate feedback. This approach contrasts with compiled languages, where code is translated into machine language before execution. The interpreted nature facilitates debugging and rapid prototyping.


## Conclusion

From the above test results its is evident that few queries need contextual search while others look for key word based search. Based on the nature of query appropriate weights can be added to dense and sparse embeddings to achieve best retrieval accuracy.