# MILVUS Demo - Grouping Search

# Grouping Search with Milvus in watsonx.data

## Disclaimers
- Use only Projects and Spaces that are available in watsonx context.

This notebook covers the Milvus Grouping Search capabilities available from Milvus 2.4.0
In Milvus, grouping search by a specific field can avoid redundancy of the same field item in the results. 

## Overview
### Audience
The scenario presented in this notebook :
- Consider a collection of products, each product has various reviews. 
- Each review is represented by one vector embedding and belongs to one product.
- To find relevant products instead of similar reviews, you can include the group_by_field argument in the search() operation to group results by the ProductId.
- This helps return the most relevant and unique products, rather than separate reviews from the same product.

Some familiarity with Python programming, search algorithms, and basic machine learning concepts is recommended. The code runs with Python 3.10 or later.

### Learning goal
This notebook demonstrates similarity search support in watsonx.data using grouping search, introducing commands for:
- Connecting to Milvus
- Creating collections
- Creating indexes
- Generate Embeddings
- Ingesting data
- Data retrieval

### About Milvus 

Milvus is an open-source vector database designed specifically for scalable similarity search and AI applications. It's a powerful platform that enables efficient storage, indexing, and retrieval of vector embeddings, which are crucial in modern machine learning and artificial intelligence tasks.[ To know more, visit Milvus Documentation](https://www.ibm.com/docs/en/watsonx/watsonxdata/2.1.x?topic=components-milvus)

### Milvus: Three Fundamental Steps

#### 1. Data Preparation
Collect and convert your data into high-dimensional vector embeddings. These vectors are typically generated using machine learning models like neural networks, which transform text, images, audio, or other data types into dense numerical representations that capture semantic meaning and relationships.

#### 2. Vector Insertion
Load the dense vector embeddings and sparse vector embeddings into Milvus collections or partitions within a database. Milvus creates indexes to optimize subsequent search operations, supporting various indexing algorithms like IVF-FLAT, HNSW, etc., based on the definition.

#### 3. Similarity Search
Perform vector similarity searches by providing a query vector and a reranking weight. Milvus will rapidly return the most similar vectors from the collection or partitions based on the defined metrics like cosine similarity, Euclidean distance, or inner product and the reranking weight.

### Why grouping search?

When entities in the search results share the same value in a scalar field, this indicates that they are similar in a particular attribute, which may negatively impact the search results.A grouping search allows Milvus to group search results by values in a specified field to aggregate data at a higher level

### Key Workflow

1. **Definition** (once)
2. **Ingestion** (once)
3. **Retrieve relevant passage(s)** (for every user query)

## Contents

- Environment Setup
- Install packages
- Document data loading
- Create connection
- Ingest data
- Retrieve relevant data

## Environment Setup

Before using the sample code in this notebook, complete the following setup tasks:

- Create a Watsonx.data instance (a free plan is offered)
  - Information about creating a watsonx.data instance can be found [here](https://www.ibm.com/docs/en/watsonx/watsonxdata/2.0.x)


## Import Libraries

This notebook uses sentence transformer to generate vector embeddings.

In [2]:
!pip show pymilvus


zsh:1: command not found: pip


In [3]:
%%capture
!pip install transformers


In [4]:
%%capture
!pip install sentence-transformers


In [5]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')

  from .autonotebook import tqdm as notebook_tqdm


## Load Data

The dataset consists of 5 products and 10 reviews for each product. 
- The first product named 'BestProductA' has all possitive reviews.
- The second product named 'GoodProductB' has mostly possitive reviews, very few negative or neutral reviews.
- The third product named 'AverageProductC' has mostly mixed or neutral reviews.
- The fourth product named 'BadProductD' has mostly negative reviews, very few positive or neutral reviews.
- The fifth product named 'WorstProductE' has all negative reviews.

For the sake of getting a hint on the overall product quality, the product names have been choosen likewise. It has no effect on similarity calculation.

In [6]:
import pandas as pd

# Define the products and their reviews
products = {
    1: ("BestProductA", ["This product is amazing!", "I love it!", "Highly recommend!", "Excellent quality!", 
                         "Works perfectly!", "Very satisfied!", "Exceeded my expectations!", "Top-notch product!", 
                         "Will buy again!", "Best purchase ever!"]),
    2: ("GoodProductB", ["Good product.", "Quite satisfied.", "Meets my needs.", "Decent quality.", "Nothing special.", 
                         "Reliable.", "Does the job.", "Overall good.", "Works as expected.", "Satisfied."]),
    3: ("AverageProductC", ["It's okay.", "Average quality.", "Complete waste of money!", "Exceeded my expectations!", "It's fine.", 
                            "Nothing special.", "Does the job.","Very disappointed!", "Just alright.", "Not impressed."]),
    4: ("BadProductD", ["Not great.", "Disappointed.", "Happy with the purchase.", "Works perfectly!", "Not worth the price.", 
                        "Subpar quality.", "Does the job.", "Expected more.", "Mediocre.", "Good product."]),
    5: ("WorstProductE", ["Terrible product!", "Hate it!", "Do not recommend.", "Awful quality!", 
                          "Doesn't work.", "Very disappointed!", "Complete waste of money!", "Worst product ever!", 
                          "Never buying again!", "Extremely unsatisfied!"])
}

# Generate the DataFrame
data = []

for product_id, (product_name, reviews) in products.items():
    for review in reviews:
        data.append({"ProductId": product_id, "ProductName": product_name, "Review": review})

df = pd.DataFrame(data)

# Print the DataFrame
df.head(5)


Unnamed: 0,ProductId,ProductName,Review
0,1,BestProductA,This product is amazing!
1,1,BestProductA,I love it!
2,1,BestProductA,Highly recommend!
3,1,BestProductA,Excellent quality!
4,1,BestProductA,Works perfectly!


In [7]:
# Add id column as primary key
df['id'] = [i for i in range(1,51)]
df.head(5)

Unnamed: 0,ProductId,ProductName,Review,id
0,1,BestProductA,This product is amazing!,1
1,1,BestProductA,I love it!,2
2,1,BestProductA,Highly recommend!,3
3,1,BestProductA,Excellent quality!,4
4,1,BestProductA,Works perfectly!,5


## Generate vectors

In [8]:
# Generate embeddings for each review and add to a new column
df['Embeddings'] = df['Review'].apply(lambda x: model.encode(x).tolist())
df = df[["id","ProductId","ProductName","Review","Embeddings"]]
df.head(5)

Unnamed: 0,id,ProductId,ProductName,Review,Embeddings
0,1,1,BestProductA,This product is amazing!,"[-0.07711850851774216, 0.011897865682840347, 0..."
1,2,1,BestProductA,I love it!,"[-0.02365611121058464, 0.016240587458014488, 0..."
2,3,1,BestProductA,Highly recommend!,"[-0.058093734085559845, -0.014794943854212761,..."
3,4,1,BestProductA,Excellent quality!,"[-0.0684385746717453, 0.060429755598306656, -0..."
4,5,1,BestProductA,Works perfectly!,"[-0.04870240017771721, -0.008712533861398697, ..."


## Connect to Milvus

In [9]:
fmt = "\n=== {:30} ===\n"
num_entities, dim = 50, 384  # Adjusted for more entities and higher dimension
BATCH_SIZE = 5
collection_name="Product_Reviews_Collection"

In [10]:
from pymilvus import MilvusClient, DataType, CollectionSchema, FieldSchema,utility

In [None]:
from pymilvus import MilvusClient, DataType

SERVER_ADDR = "https://<username>:<password>@<host>:port"
client = MilvusClient(
    uri=SERVER_ADDR,
    secure=True
)
print("Connected")

Connected


In [12]:
print("Collection Exists. Dropping collection.")
client.drop_collection(collection_name)

Collection Exists. Dropping collection.


## Create schema

In [13]:
schema = MilvusClient.create_schema(
    auto_id=False,
    enable_dynamic_field=True,
)

# 3.2. Add fields to schema
schema.add_field(field_name="id", datatype=DataType.INT64, is_primary=True),
schema.add_field(field_name="product_id", datatype=DataType.INT64),
schema.add_field(field_name="product_name", datatype=DataType.VARCHAR, max_length=15),
schema.add_field(field_name="reviews", datatype=DataType.VARCHAR, max_length=100),
schema.add_field(field_name="embeddings", datatype=DataType.FLOAT_VECTOR, dim=dim),

({'auto_id': False, 'description': '', 'fields': [{'name': 'id', 'description': '', 'type': <DataType.INT64: 5>, 'is_primary': True, 'auto_id': False}, {'name': 'product_id', 'description': '', 'type': <DataType.INT64: 5>}, {'name': 'product_name', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 15}}, {'name': 'reviews', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 100}}, {'name': 'embeddings', 'description': '', 'type': <DataType.FLOAT_VECTOR: 101>, 'params': {'dim': 384}}], 'enable_dynamic_field': True},)

In [14]:
index_params = client.prepare_index_params()

index_params.add_index(
    field_name="id",
    index_type="STL_SORT"
)

index_params.add_index(
    field_name="embeddings", 
    index_type="IVF_FLAT",
    metric_type="COSINE",
    params={ "nlist": 128 }
)


## Create collection

In [15]:
client.create_collection(
    collection_name="Product_Reviews_Collection",
    schema=schema,
    index_params=index_params
)


res = client.get_load_state(
    collection_name="Product_Reviews_Collection"
)

print(res)



{'state': <LoadState: Loaded>}


In [16]:
data = {}
data["id"]= df['id'].tolist()
data["product_id"]= df['ProductId'].tolist()
data["product_name"]= df['ProductName'].tolist()
data["reviews"]= df['Review'].tolist()
data["embeddings"] = df['Embeddings'].tolist()

df_new = pd.DataFrame(data)

# Convert the DataFrame into the desired format
data_list = []
for index, row in df_new.iterrows():
    data_list.append({
        "id": row['id'],
        "product_id": row['product_id'],
        "product_name": row['product_name'],
        "reviews": row['reviews'],
        "embeddings": row['embeddings']
    })



## Insert Data

In [18]:
res = client.insert(
    collection_name=collection_name,
    data=data_list
)

print(res)

{'insert_count': 50, 'ids': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50]}


## Query 

Tested with query texts:
- wow. superb
- worst experience in my life
- lovely
- terrible
- yukk
- falling in love with it
- was ok

NOTE - the results may vary on the quality of embedding used, the length of text and the diversity of data.

In [19]:
question_text = "lovely"
question_vector = model.encode(question_text).tolist()

## Without Grouping search

We have used COSINE similarity as the metrics. Higher the distance value more similar the datapoints.

In [20]:
# Load data into collection
client.load_collection(collection_name) # Collection name

# Search without `group_by_field`
res = client.search(
    ann="embeddings",
    collection_name=collection_name, # Collection name
    data=[question_vector], # Replace with your query vector
    search_params={
    "metric_type": "COSINE",
    "params": {},
    }, # Search parameters
    limit=10, # Max. number of search results to return
    #group_by_field="product_id", # Group results by document ID
    output_fields=["reviews","product_name"]
)

# Retrieve the values in the `doc_id` column
product_ids = [result['entity'] for result in res[0]]

print("\n")
for group in res:
    print("\nResults:")
    for hit in group:
        product_name = hit.get("entity", {}).get("product_name", "Unknown Product")
        review = hit.get("entity", {}).get("review", "No review available")
        distance = hit.get("distance", "N/A")
        print(hit)
    print("\n")




Results:
{'id': 12, 'distance': 0.4272212088108063, 'entity': {'reviews': 'Quite satisfied.', 'product_name': 'GoodProductB'}}
{'id': 6, 'distance': 0.37505069375038147, 'entity': {'reviews': 'Very satisfied!', 'product_name': 'BestProductA'}}
{'id': 18, 'distance': 0.3703482449054718, 'entity': {'reviews': 'Overall good.', 'product_name': 'GoodProductB'}}
{'id': 2, 'distance': 0.3400030732154846, 'entity': {'reviews': 'I love it!', 'product_name': 'BestProductA'}}
{'id': 4, 'distance': 0.3392608165740967, 'entity': {'reviews': 'Excellent quality!', 'product_name': 'BestProductA'}}
{'id': 30, 'distance': 0.33884724974632263, 'entity': {'reviews': 'Not impressed.', 'product_name': 'AverageProductC'}}
{'id': 40, 'distance': 0.33581191301345825, 'entity': {'reviews': 'Good product.', 'product_name': 'BadProductD'}}
{'id': 11, 'distance': 0.33581191301345825, 'entity': {'reviews': 'Good product.', 'product_name': 'GoodProductB'}}
{'id': 33, 'distance': 0.33286672830581665, 'entity': {'r

We can see there us repetation/redundancy in the products. 'GoodProductB' occurs 4 times out of top 10 results, making that the most relevant product to the query text.

## With grouping search

In [21]:
# Load data into collection
client.load_collection(collection_name) # Collection name

# Search without `group_by_field`
res = client.search(
    collection_name=collection_name, # Collection name
    data=[question_vector], # Replace with your query vector
    search_params={
    "metric_type": "COSINE",
    "params": {"nprobe": 10},
    }, # Search parameters
    limit=10, # Max. number of search results to return
    group_by_field="product_id", # Group results by document ID
    output_fields=["product_name","reviews","product_id"]
)

# Retrieve the values in the `doc_id` column
product_ids = [result['entity']['product_name'] for result in res[0]]

for group in res:
    print("\nGroup:")
    for hit in group:
        product_name = hit.get("entity", {}).get("product_name", "Unknown Product")
        review = hit.get("entity", {}).get("review", "No review available")
        distance = hit.get("distance", "N/A")
        print(hit)
print("\n")


Group:
{'id': 12, 'distance': 0.4272212088108063, 'entity': {'reviews': 'Quite satisfied.', 'product_id': 2, 'product_name': 'GoodProductB'}}
{'id': 6, 'distance': 0.37505069375038147, 'entity': {'reviews': 'Very satisfied!', 'product_id': 1, 'product_name': 'BestProductA'}}
{'id': 30, 'distance': 0.33884724974632263, 'entity': {'reviews': 'Not impressed.', 'product_id': 3, 'product_name': 'AverageProductC'}}
{'id': 40, 'distance': 0.33581191301345825, 'entity': {'reviews': 'Good product.', 'product_id': 4, 'product_name': 'BadProductD'}}
{'id': 44, 'distance': 0.27378007769584656, 'entity': {'reviews': 'Awful quality!', 'product_id': 5, 'product_name': 'WorstProductE'}}




As expected, 'GoodProductB' is on top of the result list making it most relevant and there are no repetation in search results.

Here similarity search is happening on a different field (the vector field) while the search results are grouped by productId field and returning product names as output that are relevant to the search query.