# MILVUS Demo - Scalar filtering 

# Scalar filtering during Search with Milvus in watsonx.data

## Disclaimers
- Use only Projects and Spaces that are available in watsonx context.

## Overview
### Audience
This notebook demonstrates how to implement scalar filtering.
To enhance the relevance of search results, you can filter metadata associated with vector embeddings before performing a search. When Milvus processes a search request that includes a filtering condition, it narrows the search to entities that meet the specified criteria, optimizing the search scope. 
The scenario presented in this notebook allows to filter through product details and suggest the product needed.

Some familiarity with Python programming, search algorithms, and basic machine learning concepts is recommended. The code runs with Python 3.10 or later.
### Learning goal
This notebook demonstrates Milvus scalar filtering support in watsonx.data, introducing commands for:
- Connecting to Milvus
- Creating collections
- Creating indexes
- Generate Embeddings
- Ingesting data
- Data retrieval


### About Milvus 

Milvus is an open-source vector database designed specifically for scalable similarity search and AI applications. It's a powerful platform that enables efficient storage, indexing, and retrieval of vector embeddings, which are crucial in modern machine learning and artificial intelligence tasks.[ To know more, visit Milvus Documentation](https://www.ibm.com/docs/en/watsonx/watsonxdata/2.1.x?topic=components-milvus)

### Milvus: Three Fundamental Steps

#### 1. Data Preparation
Collect and convert your data into high-dimensional vector embeddings. These vectors are typically generated using machine learning models like neural networks, which transform text, images, audio, or other data types into dense numerical representations that capture semantic meaning and relationships.

#### 2. Vector Insertion
Load the dense vector embeddings and sparse vector embeddings into Milvus collections or partitions within a database. Milvus creates indexes to optimize subsequent search operations, supporting various indexing algorithms like IVF-FLAT, HNSW, etc., based on the definition.

#### 3. Similarity Search
Perform vector similarity searches by providing a query vector and a reranking weight. Milvus will rapidly return the most similar vectors from the collection or partitions based on the defined metrics like cosine similarity, Euclidean distance, or inner product and the reranking weight.

### Why use Scalar Filtering
Scalar filtering during search is a technique that restricts the search space by applying filters based on scalar attributes (e.g., numerical, categorical, or boolean data) associated with each record or vector. This is done before executing a similarity search, such as an Approximate Nearest Neighbor (ANN) search. This will refine the search scope resulting in improved search relevance and optimized search performance

### Key Workflow

1. **Definition** (once)
2. **Ingestion** (once)
3. **Retrieve relevant passage(s)** (for every user query)

## Contents

- Environment Setup
- Install packages
- Document data loading
- Create connection
- Ingest data
- Retrieve relevant data

## Environment Setup

Before using the sample code in this notebook, complete the following setup tasks:

- Create a Watsonx.data instance (a free plan is offered)
  - Information about creating a watsonx.data instance can be found [here](https://www.ibm.com/docs/en/watsonx/watsonxdata/2.0.x)


## Install required packages

In [1]:
%%capture
!pip install numpy

In [2]:
%%capture
!pip install torch

In [3]:
%%capture
!pip install sentence-transformers

### Install Pymilvus SDK

In [4]:
# !pip install pymilvus
# Restart Kernal
!pip show pymilvus

zsh:1: command not found: pip


### Post pymilvus installations

In [5]:
import psutil
import sys
import numpy as np
import time
import pandas as pd
from pymilvus import MilvusClient, DataType, CollectionSchema, FieldSchema,utility,connections,Collection

In [6]:
%%capture
%pip install "pymilvus[model]"

In [7]:
%%capture
%pip install tensorflow

In [8]:
%%capture
%pip install --upgrade transformers

## Preparing data

In [9]:
%%capture
%pip install bs4

You can either create your own dataset tailored to your specific needs or source one from reputable online repositories. Make sure it has the following columns.

In [10]:
from bs4 import BeautifulSoup

# Load the CSV file
file_path = './data/Fashion Dataset.csv'  # Replace with your CSV file path
columns_to_extract = ['p_id', 'name', 'price', 'colour','avg_rating','description']  # Replace with your desired column names

# Read the CSV file and extract selected columns
df = pd.read_csv(file_path, usecols=columns_to_extract, keep_default_na=False).replace('', 0).head(50)


In [11]:

def clean_html(html):
    if not isinstance(html, str):  # Handle non-string values
        html = str(html)
    return BeautifulSoup(html, "html.parser").get_text()

# Apply the function to the 'description' column
df['cleaned_description'] = df['description'].apply(clean_html)

# Drop the 'description' column
df = df.drop(columns=['description'])

# Save the extracted data to a new CSV (optional)
output_file_path = 'extracted_columns.csv'
df.to_csv(output_file_path, index=False)  


In [12]:
df.head(5)

Unnamed: 0,p_id,name,price,colour,avg_rating,cleaned_description
0,17048614,Khushal K Women Black Ethnic Motifs Printed Ku...,5099,Black,4.4183989385227775,Black printed Kurta with Palazzos with dupatta...
1,16524740,InWeave Women Orange Solid Kurta with Palazzos...,5899,Orange,4.119333950046253,Orange solid Kurta with Palazzos with dupattaK...
2,16331376,Anubhutee Women Navy Blue Ethnic Motifs Embroi...,4899,Navy Blue,4.161529680365297,Navy blue embroidered Kurta with Trousers with...
3,14709966,Nayo Women Red Floral Printed Kurta With Trous...,3699,Red,4.088986141502553,Red printed kurta with trouser and dupattaKurt...
4,11056154,AHIKA Women Black & Green Printed Straight Kurta,1350,Black,3.978377362038169,"Black and green printed straight kurta, has a ..."


#### Setting Up BM25 and SentenceTransformer for Text Analysis

In [13]:
from pymilvus.model.sparse.bm25.tokenizers import build_default_analyzer
from pymilvus.model.sparse import BM25EmbeddingFunction

  from .autonotebook import tqdm as notebook_tqdm


In [14]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

In [15]:

analyzer = build_default_analyzer(language="en")

bm25_ef = BM25EmbeddingFunction(analyzer)

### Connect to Milvus

In [16]:
from pymilvus import MilvusClient
# Replace Placeholder Values <> with respective provisioned Milvus Values .

uri = "https://<host>:<port>"  # Construct URI from host and port
user = "<>"
password = "<>"
# Create an instance of the MilvusClient class with the new configuration
"""
#On Prem
milvus_client = MilvusClient(uri=uri,
                            user=user,
                            password=password,
                            secure=True,
                            server_pem_path='<>',
                            server_name='<>',)

# SaaS
milvus_client = MilvusClient(uri=uri, 
                             user=user, 
                             password=password,
                             secure=True,
                             server_name='<>',)
"""

In [17]:
COLLECTION_NAME = "Milvus_test_scalar_filter"
DIMENSION = 384
BATCH_SIZE = 2
TOPK = 1
fmt = "=== {:30} ==="
search_latency_fmt = "search latency = {:.4f}s"

In [18]:
if milvus_client.has_collection(collection_name=COLLECTION_NAME):
    milvus_client.drop_collection(collection_name=COLLECTION_NAME)

In [19]:
milvus_client.has_collection(collection_name=COLLECTION_NAME)

False

## Create Milvus schema 
[more about schema](https://www.ibm.com/docs/en/watsonx/watsonxdata/2.1.x?topic=milvus-connecting-service#taskconctmilvus__postreq__1)

In [20]:
# Create schema
schema = milvus_client.create_schema(
    auto_id=False,
    enable_dynamic_field=True,
)
# [pid,name,cleaned_description,price,avg_rating,colour,name_embedding,description_embedding]
# Add fields to schema
schema.add_field(field_name="p_id", datatype=DataType.INT64, is_primary=True),
schema.add_field(field_name="name", datatype=DataType.VARCHAR, max_length=65535),
schema.add_field(field_name="cleaned_description", datatype=DataType.VARCHAR, max_length=65535),
schema.add_field(field_name="price",datatype=DataType.FLOAT)
schema.add_field(field_name="avg_rating",datatype=DataType.FLOAT,nullable=True )
schema.add_field(field_name="colour",datatype=DataType.VARCHAR, max_length=25)
schema.add_field(field_name="name_embedding", datatype=DataType.SPARSE_FLOAT_VECTOR),
schema.add_field(field_name="description_embedding", datatype=DataType.FLOAT_VECTOR, dim=DIMENSION)

{'auto_id': False, 'description': '', 'fields': [{'name': 'p_id', 'description': '', 'type': <DataType.INT64: 5>, 'is_primary': True, 'auto_id': False}, {'name': 'name', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 65535}}, {'name': 'cleaned_description', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 65535}}, {'name': 'price', 'description': '', 'type': <DataType.FLOAT: 10>}, {'name': 'avg_rating', 'description': '', 'type': <DataType.FLOAT: 10>, 'nullable': True}, {'name': 'colour', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 25}}, {'name': 'name_embedding', 'description': '', 'type': <DataType.SPARSE_FLOAT_VECTOR: 104>}, {'name': 'description_embedding', 'description': '', 'type': <DataType.FLOAT_VECTOR: 101>, 'params': {'dim': 384}}], 'enable_dynamic_field': True}

## Create Index 
[more on indexes](https://www.ibm.com/docs/en/watsonx/watsonxdata/2.1.x?topic=milvus-connecting-service#taskconctmilvus__postreq__1)

In [21]:
# Create index parameters
index_params = milvus_client.prepare_index_params()

# Add first index for text_embedding
index_params.add_index(
    field_name="name_embedding",
    index_type="SPARSE_INVERTED_INDEX",
    metric_type="IP",
    params={"drop_ratio_build": 0.2}
)

# Add second index for context_embedding
index_params.add_index(
    field_name="description_embedding",
    index_type="IVF_SQ8",
    metric_type="L2",
    params={"nlist": 128}
)

## Create Collection and Load Data 

In [22]:
# Create index and load collection
milvus_client.create_collection(
    collection_name=COLLECTION_NAME,
    schema=schema,
    index_params=index_params
)

# Load the collection
milvus_client.load_collection(collection_name=COLLECTION_NAME)

## Generate Embeddings

We are going to generate 2 type of embeddings. One using sentence transformer which is more context aware (sematic), while the second using BM25 - sparse vector embeddings based on TF-IDF focused on keyword search. Click to know more about [Dense](https://github.ibm.com/Aldrin-Dennis1/milvus-enhanced-documentation/blob/main/in-memmory-indexes-and-similarity-metrics.md) and [Sparse](https://github.ibm.com/Aldrin-Dennis1/milvus-enhanced-documentation/blob/main/in-memmory-indexes-and-similarity-metrics-sparse-embeddings.md) vector embeddings or refer [In-Memory Index](https://milvus.io/docs/index.md?tab=floating) 

### 1. Dense embeddings - Sentence Transformer Embeddings - semantic

In [23]:

# Generate embeddings
# print(type(df['description_embedding']))
df['description_embedding'] = df['cleaned_description'].apply(lambda x: model.encode(x).tolist())

# Show the DataFrame with embeddings
df.head(5)


Unnamed: 0,p_id,name,price,colour,avg_rating,cleaned_description,description_embedding
0,17048614,Khushal K Women Black Ethnic Motifs Printed Ku...,5099,Black,4.4183989385227775,Black printed Kurta with Palazzos with dupatta...,"[-0.022137021645903587, 0.08476054668426514, -..."
1,16524740,InWeave Women Orange Solid Kurta with Palazzos...,5899,Orange,4.119333950046253,Orange solid Kurta with Palazzos with dupattaK...,"[-0.018848655745387077, 0.065238818526268, -0...."
2,16331376,Anubhutee Women Navy Blue Ethnic Motifs Embroi...,4899,Navy Blue,4.161529680365297,Navy blue embroidered Kurta with Trousers with...,"[-0.06157408282160759, 0.08424308151006699, 0...."
3,14709966,Nayo Women Red Floral Printed Kurta With Trous...,3699,Red,4.088986141502553,Red printed kurta with trouser and dupattaKurt...,"[-0.05737128108739853, 0.07336778193712234, -0..."
4,11056154,AHIKA Women Black & Green Printed Straight Kurta,1350,Black,3.978377362038169,"Black and green printed straight kurta, has a ...","[-0.044882744550704956, 0.088098905980587, 0.0..."


### 2. Sparse Embeddings - BM25 Embeddings - keyword based

In [24]:
corpus = df['name'].tolist()

In [25]:
tokens = []
for i in corpus:
    tokens.append(analyzer(i))
print("tokens:", tokens)

tokens: [['khushal', 'k', 'women', 'black', 'ethnic', 'motif', 'print', 'kurta', 'palazzo', 'dupatta'], ['inweav', 'women', 'orang', 'solid', 'kurta', 'palazzo', 'floral', 'print', 'dupatta'], ['anubhute', 'women', 'navi', 'blue', 'ethnic', 'motif', 'embroid', 'thread', 'work', 'kurta', 'trouser', 'dupatta'], ['nayo', 'women', 'red', 'floral', 'print', 'kurta', 'trouser', 'dupatta'], ['ahika', 'women', 'black', 'green', 'print', 'straight', 'kurta'], ['soch', 'women', 'red', 'thread', 'work', 'georgett', 'anarkali', 'kurta'], ['liba', 'women', 'navi', 'blue', 'pure', 'cotton', 'floral', 'print', 'kurta', 'palazzo', 'dupatta'], ['ahalyaa', 'women', 'beig', 'floral', 'print', 'regular', 'got', 'ta', 'patti', 'kurta', 'palazzo', 'dupatta'], ['anouk', 'women', 'yellow', 'white', 'print', 'kurta', 'palazzo'], ['khushal', 'k', 'women', 'green', 'pink', 'print', 'pure', 'cotton', 'kurta', 'palazzo', 'dupatta'], ['liba', 'floral', 'bliss', 'side', 'pocket', 'cotton', 'kurta', 'set'], ['varanga

In [26]:
import os

os.environ["TOKENIZERS_PARALLELISM"] = "false"
bm25_ef = BM25EmbeddingFunction(analyzer)

bm25_ef.fit(corpus)

In [27]:
# Create embeddings for the documents
docs_embeddings = bm25_ef.encode_documents(corpus)

# print("Embeddings:", docs_embeddings)
# Since the output embeddings are in a 2D csr_array format, we convert them to a list for easier manipulation.
#print("Sparse dim:", bm25_ef.dim, list(docs_embeddings)[0].shape)

In [28]:
docs_embeddings.shape

(50, 106)

In [29]:
# Convert into a format the Milvus expects
import numpy as np

def csr_to_dict_list(csr_array):
    result = []
    indptr = csr_array.indptr
    indices = csr_array.indices
    data = csr_array.data

    for i in range(csr_array.shape[0]):
        start, end = indptr[i], indptr[i+1]
        row_indices = indices[start:end]
        row_data = data[start:end]
        row_dict = dict(zip(row_indices, row_data))
        result.append(row_dict)
    
    return result

# Use the function
converted_data = csr_to_dict_list(docs_embeddings)

In [30]:
df['name_embedding'] = converted_data
df = df[['p_id','name','cleaned_description','price','avg_rating','colour','name_embedding','description_embedding']]
df.head(5)

Unnamed: 0,p_id,name,cleaned_description,price,avg_rating,colour,name_embedding,description_embedding
0,17048614,Khushal K Women Black Ethnic Motifs Printed Ku...,Black printed Kurta with Palazzos with dupatta...,5099,4.4183989385227775,Black,"{0: 0.982535, 1: 0.982535, 2: 0.982535, 3: 0.9...","[-0.022137021645903587, 0.08476054668426514, -..."
1,16524740,InWeave Women Orange Solid Kurta with Palazzos...,Orange solid Kurta with Palazzos with dupattaK...,5899,4.119333950046253,Orange,"{2: 1.0298684, 6: 1.0298684, 7: 1.0298684, 8: ...","[-0.018848655745387077, 0.065238818526268, -0...."
2,16331376,Anubhutee Women Navy Blue Ethnic Motifs Embroi...,Navy blue embroidered Kurta with Trousers with...,4899,4.161529680365297,Navy Blue,"{2: 0.8998223, 4: 0.8998223, 5: 0.8998223, 7: ...","[-0.06157408282160759, 0.08424308151006699, 0...."
3,14709966,Nayo Women Red Floral Printed Kurta With Trous...,Red printed kurta with trouser and dupattaKurt...,3699,4.088986141502553,Red,"{2: 1.081993, 6: 1.081993, 7: 1.081993, 9: 1.0...","[-0.05737128108739853, 0.07336778193712234, -0..."
4,11056154,AHIKA Women Black & Green Printed Straight Kurta,"Black and green printed straight kurta, has a ...",1350,3.978377362038169,Black,"{2: 1.1396754, 3: 1.1396754, 6: 1.1396754, 7: ...","[-0.044882744550704956, 0.088098905980587, 0.0..."


# Ingestion

In [31]:
# Prepare data for insertion as an array of dictionaries
data_to_insert = [
    {
        "p_id": int(row["p_id"]),
        "name": str(row["name"]),
        "cleaned_description": str(row["cleaned_description"]),
        "price": float(row["price"]),
        "avg_rating": float(row["avg_rating"]),
        "colour": row["colour"],
        "name_embedding": row["name_embedding"],
        "description_embedding": row["description_embedding"],
    }
    for _, row in df.iterrows()
]

# Insert data into the Milvus collection
res = milvus_client.insert(
    collection_name=COLLECTION_NAME,
    data=data_to_insert
)

print("Data inserted successfully:", res)

Data inserted successfully: {'insert_count': 50, 'ids': [17048614, 16524740, 16331376, 14709966, 11056154, 18704418, 14046594, 14951330, 13791594, 17048604, 10356859, 12413214, 16600750, 9867983, 14399798, 18372852, 15241816, 10808284, 14376546, 19240164, 15055304, 14649544, 17581924, 13810898, 15150620, 19240246, 12766966, 14023594, 14346084, 18829920, 11459658, 16522612, 15150642, 16875218, 10808290, 12246214, 9438657, 16331336, 7763575, 19357994, 18957838, 18425500, 14925332, 18811288, 15646472, 18503778, 15561680, 16857670, 13437328, 19181470]}


## Querying with scalar filtering

Scalar filtering makes Milvus conduct metadata filtering during search reducing the search scope from the whole collection to only the entities matching the specified filtering conditions.

more on it [scalar filtering](https://milvus.io/docs/filtered-search.md#Filtered-Search)

In [35]:
query1 = ["show me cloths that are flowy and with flower designs"]

In [33]:
#semantic search
st_query_embeddings = model.encode(query1)
print("Query text ST Embeddings:", st_query_embeddings.shape)


Query text ST Embeddings: (1, 384)


In [34]:
query_vector = st_query_embeddings
res = milvus_client.search(
    collection_name=COLLECTION_NAME,
    data=query_vector,
    limit=3,
    filter='price < 3000 and avg_rating > 3.5', #scalar filter - to keep price lower than 3000 and rating a minimum 3.5 
    output_fields=["name", "price", "cleaned_description"],
    anns_field= "description_embedding"
)

for hit in res[0]:
    entity = hit['entity']
    print(f"\nname: {entity['name']}\nPrice: {entity['price']}\ncleaned_description: {entity['cleaned_description']}")


name: KALINI Women Pink Floral Embroidered Layered Gotta Patti Kurti with Sharara & With Dupatta
Price: 2999.0
cleaned_description: Pink embroidered Kurti with Sharara with dupattaKurti design: Floral embroideredA-line shapeLayered styleKeyhole neck, three quarter regular sleevesNa pockets gotta patti detailAbove knee length with curved hemViscose rayon machine weave fabricSharara design: Self design ShararaElasticated waistbandZip closureThe model (height 5'8) is wearing a size SMachine Wash

name: Anouk Women Red Floral Print A-Line Kurta
Price: 1699.0
cleaned_description: Red Floral Print A-line kurta, has a round neck, short sleeves, flared hem, has gathers along the waistThe model (height 5'8") is wearing a size S100% viscoseHand-wash

name: Vishudh Women Navy Blue Floral Printed Regular Pure Cotton Kurta with Palazzos
Price: 2149.0
cleaned_description: Navy blue printed Kurta with Palazzos    Kurta design:     Floral printed   Straight shape   Regular style   Mandarin collar,  t