# Milvus Grouping Search
This notebook covers the Milvus Grouping Search capabilities available from Milvus 2.4.0

In Milvus, grouping search by a specific field can avoid redundancy of the same field item in the results.
* Consider a collection of products, each product has various reviews.
* Each review is represented by one vector embedding and belongs to one product.
* To find relevant products instead of similar reviews, you can include the group_by_field argument in the search() operation to group results by the ProductId.
* This helps return the most relevant and unique products, rather than separate reviews from the same product.

## Import Libraries  
This notebook uses sentence transformer to generate vector embeddings.

In [1]:
!pip3 show pymilvus

[0m

In [2]:
%%capture
!pip3 install transformers

In [3]:
%%capture
!pip3 install sentence-transformers

In [4]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')

  from tqdm.autonotebook import tqdm, trange


## Load Data
The dataset consists of 5 products and 10 reviews for each product.

* The first product named 'BestProductA' has all possitive reviews.
* The second product named 'GoodProductB' has mostly possitive reviews, very few negative or neutral reviews.
* The third product named 'AverageProductC' has mostly mixed or neutral reviews.
* The fourth product named 'BadProductD' has mostly negative reviews, very few positive or neutral reviews.
* The fifth product named 'WorstProductE' has negative reviews with very few neutral reviews.

For the sake of getting a hint on the overall product quality, the product names have been choosen likewise. It has no effect on similarity calculation.

In [5]:
import pandas as pd

# Define the products and their reviews
products = {
    1: ("BestProductA", ["This product is amazing!", "I love it!", "Highly recommend!", "Excellent quality!", 
                         "Works perfectly!", "Very satisfied!", "Exceeded my expectations!", "Top-notch product!", 
                         "Will buy again!", "Best purchase ever!"]),
    2: ("GoodProductB", ["Good product.", "Quite satisfied.", "Meets my needs.", "Decent quality.", "Nothing special.", 
                         "Reliable.", "Does the job.", "Overall good.", "Works as expected.", "Satisfied."]),
    3: ("AverageProductC", ["It's okay.", "Average quality.", "Complete waste of money!", "Exceeded my expectations!", "It's fine.", 
                            "Nothing special.", "Does the job.","Very disappointed!", "Just alright.", "Impressed with the product."]),
    4: ("BadProductD", ["Not great.", "Disappointed.", "Happy with the purchase.", "Works perfectly!", "Not worth the price.", 
                        "Subpar quality.", "Does the job.", "Expected more.", "Mediocre.", "Good product."]),
    5: ("WorstProductE", ["Terrible product!", "Hate it!", "Do not recommend.", "Not Bad", 
                          "Okayish product", "Very disappointed!", "Complete waste of money!", "Worst product ever!", 
                          "Never buying again!", "Extremely unsatisfied!"])
}

# Generate the DataFrame
data = []

for product_id, (product_name, reviews) in products.items():
    for review in reviews:
        data.append({"ProductId": product_id, "ProductName": product_name, "Review": review})

df = pd.DataFrame(data)

# Print the DataFrame
df

Unnamed: 0,ProductId,ProductName,Review
0,1,BestProductA,This product is amazing!
1,1,BestProductA,I love it!
2,1,BestProductA,Highly recommend!
3,1,BestProductA,Excellent quality!
4,1,BestProductA,Works perfectly!
5,1,BestProductA,Very satisfied!
6,1,BestProductA,Exceeded my expectations!
7,1,BestProductA,Top-notch product!
8,1,BestProductA,Will buy again!
9,1,BestProductA,Best purchase ever!


In [6]:
# Add id column as primary key
df['id'] = [i for i in range(1,51)]
df

Unnamed: 0,ProductId,ProductName,Review,id
0,1,BestProductA,This product is amazing!,1
1,1,BestProductA,I love it!,2
2,1,BestProductA,Highly recommend!,3
3,1,BestProductA,Excellent quality!,4
4,1,BestProductA,Works perfectly!,5
5,1,BestProductA,Very satisfied!,6
6,1,BestProductA,Exceeded my expectations!,7
7,1,BestProductA,Top-notch product!,8
8,1,BestProductA,Will buy again!,9
9,1,BestProductA,Best purchase ever!,10


## Generate Vectors

In [7]:
# Generate embeddings for each review and add to a new column
df['Embeddings'] = df['Review'].apply(lambda x: model.encode(x).tolist())
df = df[["id","ProductId","ProductName","Review","Embeddings"]]
df

Unnamed: 0,id,ProductId,ProductName,Review,Embeddings
0,1,1,BestProductA,This product is amazing!,"[-0.07711853086948395, 0.011897844262421131, 0..."
1,2,1,BestProductA,I love it!,"[-0.023656103760004044, 0.01624063029885292, 0..."
2,3,1,BestProductA,Highly recommend!,"[-0.058093756437301636, -0.014794892631471157,..."
3,4,1,BestProductA,Excellent quality!,"[-0.0684385746717453, 0.060429785400629044, -0..."
4,5,1,BestProductA,Works perfectly!,"[-0.0487024150788784, -0.008712494745850563, -..."
5,6,1,BestProductA,Very satisfied!,"[-0.0765625387430191, 0.027355119585990906, 0...."
6,7,1,BestProductA,Exceeded my expectations!,"[-0.026856569573283195, 0.00038697515265084803..."
7,8,1,BestProductA,Top-notch product!,"[-0.06905435770750046, 0.013683732599020004, 0..."
8,9,1,BestProductA,Will buy again!,"[-0.06660202145576477, -0.048435360193252563, ..."
9,10,1,BestProductA,Best purchase ever!,"[-0.10550551116466522, 0.05549986660480499, -0..."


## Connect to Milvus 

In [8]:
fmt = "\n=== {:30} ===\n"
num_entities, dim = 50, 384  # Adjusted for more entities and higher dimension
BATCH_SIZE = 5
collection_name="Product_Reviews_Collection"

In [9]:
from pymilvus import MilvusClient, DataType, CollectionSchema, FieldSchema,utility

In [10]:
uri = "http://<hostname>:<port>"  # Construct URI from host and port
user = "<username>"
password = "<password>"

# Create an instance of the MilvusClient class with the new configuration
client = MilvusClient(uri=uri, user=user, password=password,
                            secure=True,
                            server_pem_path='<path/to/the/cert>',
                            server_name='<hostname>',)
print("Connected")

Connected


In [11]:
print("Collection Exists. Dropping collection.")
client.drop_collection(collection_name)

Collection Exists. Dropping collection.


## Create Schema

In [12]:
schema = MilvusClient.create_schema(
    auto_id=False,
    enable_dynamic_field=True,
)

# 3.2. Add fields to schema
schema.add_field(field_name="id", datatype=DataType.INT64, is_primary=True),
schema.add_field(field_name="product_id", datatype=DataType.INT64),
schema.add_field(field_name="product_name", datatype=DataType.VARCHAR, max_length=15),
schema.add_field(field_name="reviews", datatype=DataType.VARCHAR, max_length=100),
schema.add_field(field_name="embeddings", datatype=DataType.FLOAT_VECTOR, dim=dim),

({'auto_id': False, 'description': '', 'fields': [{'name': 'id', 'description': '', 'type': <DataType.INT64: 5>, 'is_primary': True, 'auto_id': False}, {'name': 'product_id', 'description': '', 'type': <DataType.INT64: 5>}, {'name': 'product_name', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 15}}, {'name': 'reviews', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 100}}, {'name': 'embeddings', 'description': '', 'type': <DataType.FLOAT_VECTOR: 101>, 'params': {'dim': 384}}], 'enable_dynamic_field': True},)

In [13]:
index_params = client.prepare_index_params()

index_params.add_index(
    field_name="id",
    index_type="STL_SORT"
)

index_params.add_index(
    field_name="embeddings", 
    index_type="IVF_FLAT",
    metric_type="COSINE",
    params={ "nlist": 128 }
)

## Create Collection

In [14]:
client.create_collection(
    collection_name="Product_Reviews_Collection",
    schema=schema,
    index_params=index_params
)


res = client.get_load_state(
    collection_name="Product_Reviews_Collection"
)

print(res)

{'state': <LoadState: Loaded>}


In [15]:
data = {}
data["id"]= df['id'].tolist()
data["product_id"]= df['ProductId'].tolist()
data["product_name"]= df['ProductName'].tolist()
data["reviews"]= df['Review'].tolist()
data["embeddings"] = df['Embeddings'].tolist()

df_new = pd.DataFrame(data)

# Convert the DataFrame into the desired format
data_list = []
for index, row in df_new.iterrows():
    data_list.append({
        "id": row['id'],
        "product_id": row['product_id'],
        "product_name": row['product_name'],
        "reviews": row['reviews'],
        "embeddings": row['embeddings']
    })

## Insert Data

In [16]:
res = client.insert(
    collection_name="Product_Reviews_Collection",
    data=data_list
)

print(res)

{'insert_count': 50, 'ids': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50], 'cost': 0}


## Query 

Tested with query texts:

* superb
* worst experience in my life
* lovely
* terrible
* yuck
* falling in love with it
* was ok
  
 NOTE - the results may vary on the quality of embedding used, the length of text and the diversity of data.

In [17]:
question_text = "Superb"
question_vector = model.encode(question_text).tolist()

## Without Grouping Search 
We have used COSINE similarity as the metrics. Higher the distance value more similar the datapoints.

In [18]:
client.load_collection(collection_name) # Collection name

# Search without `group_by_field`
res = client.search(
    ann="embeddings",
    collection_name=collection_name, # Collection name
    data=[question_vector], # Replace with your query vector
    search_params={
    "metric_type": "COSINE",
    "params": {},
    }, # Search parameters
    limit=10, # Max. number of search results to return
    #group_by_field="product_id", # Group results by document ID
    output_fields=["reviews","product_name"]
)

# Retrieve the values in the `doc_id` column
product_ids = [result['entity'] for result in res[0]]

print("\n")
for group in res:
    print("\nResults:")
    for hit in group:
        product_name = hit.get("entity", {}).get("product_name", "Unknown Product")
        review = hit.get("entity", {}).get("review", "No review available")
        distance = hit.get("distance", "N/A")
        print(hit)
    print("\n")




Results:
{'id': 6, 'distance': 0.48640212416648865, 'entity': {'reviews': 'Very satisfied!', 'product_name': 'BestProductA'}}
{'id': 20, 'distance': 0.4796062111854553, 'entity': {'reviews': 'Satisfied.', 'product_name': 'GoodProductB'}}
{'id': 12, 'distance': 0.47872641682624817, 'entity': {'reviews': 'Quite satisfied.', 'product_name': 'GoodProductB'}}
{'id': 18, 'distance': 0.436907023191452, 'entity': {'reviews': 'Overall good.', 'product_name': 'GoodProductB'}}
{'id': 4, 'distance': 0.4205872714519501, 'entity': {'reviews': 'Excellent quality!', 'product_name': 'BestProductA'}}
{'id': 14, 'distance': 0.418302059173584, 'entity': {'reviews': 'Decent quality.', 'product_name': 'GoodProductB'}}
{'id': 30, 'distance': 0.3950171172618866, 'entity': {'reviews': 'Impressed with the product.', 'product_name': 'AverageProductC'}}
{'id': 34, 'distance': 0.362601637840271, 'entity': {'reviews': 'Works perfectly!', 'product_name': 'BadProductD'}}
{'id': 5, 'distance': 0.362601637840271, 'e

## With Grouping Search

In [19]:
# Load data into collection
client.load_collection(collection_name) # Collection name

# Search without `group_by_field`
res = client.search(
    collection_name=collection_name, # Collection name
    data=[question_vector], # Replace with your query vector
    search_params={
    "metric_type": "COSINE",
    "params": {"nprobe": 10},
    }, # Search parameters
    limit=10, # Max. number of search results to return
    group_by_field="product_id", # Group results by document ID
    output_fields=["product_name","reviews","product_id"]
)

# Retrieve the values in the `doc_id` column
product_ids = [result['entity']['product_name'] for result in res[0]]

for group in res:
    print("\nGroup:")
    for hit in group:
        product_name = hit.get("entity", {}).get("product_name", "Unknown Product")
        review = hit.get("entity", {}).get("review", "No review available")
        distance = hit.get("distance", "N/A")
        print(hit)
print("\n")


Group:
{'id': 6, 'distance': 0.48640212416648865, 'entity': {'product_name': 'BestProductA', 'reviews': 'Very satisfied!', 'product_id': 1}}
{'id': 20, 'distance': 0.4796062111854553, 'entity': {'product_name': 'GoodProductB', 'reviews': 'Satisfied.', 'product_id': 2}}
{'id': 30, 'distance': 0.3950171172618866, 'entity': {'product_name': 'AverageProductC', 'reviews': 'Impressed with the product.', 'product_id': 3}}
{'id': 34, 'distance': 0.362601637840271, 'entity': {'product_name': 'BadProductD', 'reviews': 'Works perfectly!', 'product_id': 4}}
{'id': 44, 'distance': 0.35716643929481506, 'entity': {'product_name': 'WorstProductE', 'reviews': 'Not Bad', 'product_id': 5}}




As expected, 'BestProductA' is on top of the result list making it most relevant and there are no repetation in search results.
Here similarity search is happening on a different field (the vector field) while the search results are grouped by productId field and returning product names as output that are relevant to the search query.

Copyright © IBM Corp. 2024. This notebook and its source code are released under the terms of the MIT License.