# MILVUS Demo - BitMap Index

## BitMap Indexing for scalar filtering with Milvus in watsonx.data
### Overview
This notebook demonsrates how to implement and use BitMap Inexing for querying in Milvus. For the scenario we have created a dataset of Companies(Id, Name, Category, Public, Description). 

Some familiarity with Python programming, search algorithms, and basic machine learning concepts is recommended. The code runs with Python 3.9 or later.

### Learning goal
This notebook demonstrates similarity search support in watsonx.data using hybrid approach using dense embeddings and sparse embeddings, introducing commands for:
- Connecting to Milvus
- Creating collections
- Creating indexes
- Generate Embeddings
- Ingesting data
- Data retrieval

### About Milvus 

Milvus is an open-source vector database designed specifically for scalable similarity search and AI applications. It's a powerful platform that enables efficient storage, indexing, and retrieval of vector embeddings, which are crucial in modern machine learning and artificial intelligence tasks.[ To know more, visit Milvus Documentation](https://www.ibm.com/docs/en/watsonx/watsonxdata/2.1.x?topic=components-milvus)

### Milvus: Three Fundamental Steps

#### 1. Data Preparation
Collect and convert your data into high-dimensional vector embeddings. These vectors are typically generated using machine learning models like neural networks, which transform text, images, audio, or other data types into dense numerical representations that capture semantic meaning and relationships.

#### 2. Vector Insertion
Load the dense vector embeddings and sparse vector embeddings into Milvus collections or partitions within a database. Milvus creates indexes to optimize subsequent search operations, supporting various indexing algorithms like IVF-FLAT, HNSW, etc., based on the definition.

#### 3. Similarity Search
Perform vector similarity searches by providing a query vector and a filter. Milvus will rapidly return the most similar vectors from the collection or partitions based on the defined metrics like cosine similarity, Euclidean distance, or inner product and the filter.

### Why BitMap Index
BitMap Index creates an index on a scalar field and this can be used to filter out outputs for query thus reducing retrieval time. For more details on BitMap Indexing, please find [link](https://milvus.io/docs/bitmap.md). 

It can be used along side a query or on its own for filtering the data.

### Usecase

The Dataset contains details like name, whether its public, description of what the company does and the category it falls into of 30 companies. We will try to extract Tech companies that are Public and are working with AI. We will also try filtering out Tech companies that are public from the dataset.

### Key Workflow

1. **Definition** (once)
2. **Ingestion** (once)
3. **Retrieve relevant passage(s)** (for every user query)

## Contents

- Environment Setup
- Install packages
- Document data loading
- Create connection
- Ingest data
- Retrieve relevant data

## Environment Setup

Before using the sample code in this notebook, complete the following setup tasks:

- Create a Watsonx.data instance (a free plan is offered)
  - Information about creating a watsonx.data instance can be found [here](https://www.ibm.com/docs/en/watsonx/watsonxdata/2.0.x)


## Install required packages

In [1]:
%pip install pandas

You should consider upgrading via the '/usr/bin/python -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.


### Install Pymilvus SDK

In [2]:
# %pip install pymilvus
# Restart Kernal
%pip show pymilvus

Name: pymilvus
Version: 2.5.4
Summary: Python Sdk for Milvus
Home-page: 
Author: 
Author-email: Milvus Team <milvus-team@zilliz.com>
License: 
Location: /usr/local/lib/python3.9/site-packages
Requires: grpcio, milvus-lite, pandas, protobuf, python-dotenv, setuptools, ujson
Required-by: 
Note: you may need to restart the kernel to use updated packages.


## Data Preparation

In [3]:
import pandas as pd

# Define the data as a dictionary
data = {
    1: (1, "ABC", "Tech", True, "AI research and deployment organization"),
    2: (2, "DEF", "News", False, "Global news and information services"),
    3: (3, "GHI", "Tech", False, "Leading provider of search and cloud technologies"),
    4: (4, "JKL", "Finance", True, "Financial services and investment banking"),
    5: (5, "MNO", "Finance", False, "Simplified stock trading platform"),
    6: (6, "PQR", "Healthcare", True, "Global pharmaceutical and biotechnology corporation"),
    7: (7, "STU", "Education", False, "Free online courses and educational resources"),
    8: (8, "VWX", "Tech", True, "Electric vehicles and clean energy solutions"),
    9: (9, "YZA", "Entertainment", False, "Subscription-based streaming platform"),
    10: (10, "BCD", "Retail", True, "E-commerce and cloud computing giant"),
    11: (11, "EFG", "Tech", True, "Software development and cloud services"),
    12: (12, "HIJ", "News", True, "British broadcaster providing global news"),
    13: (13, "KLM", "Tech", True, "Consumer electronics and software development"),
    14: (14, "NOP", "Finance", True, "Investment banking and financial services"),
    15: (15, "QRS", "Aerospace", True, "Private space exploration company"),
    16: (16, "TUV", "News", False, "Cable news network offering global coverage"),
    17: (17, "WXY", "Education", True, "Online courses from top universities"),
    18: (18, "ZAB", "Entertainment", True, "Media and entertainment conglomerate"),
    19: (19, "CDE", "Retail", True, "Sports apparel and equipment brand"),
    20: (20, "FGH", "Tech", False, "Video conferencing and online communication tools"),
    21: (21, "IJK", "Automotive", True, "Automobile manufacturing company"),
    22: (22, "LMN", "Hospitality", False, "Short-term rentals and vacation homes platform"),
    23: (23, "OPQ", "Tech", True, "Social media and virtual reality technologies"),
    24: (24, "RST", "News", True, "Daily newspaper with global reach"),
    25: (25, "UVW", "Retail", True, "E-commerce and online retail platform"),
    26: (26, "XYZ", "Healthcare", True, "Healthcare products and pharmaceuticals"),
    27: (27, "ABC1", "Education", False, "Ivy League university offering various programs"),
    28: (28, "DEF1", "Entertainment", True, "Electronics and entertainment solutions"),
    29: (29, "GHI1", "Tech", True, "Professional networking platform"),
    30: (30, "JKL1", "Transportation", False, "Ride-hailing and delivery services"),
}

# Convert the dictionary into a DataFrame
df = pd.DataFrame.from_dict(
    data,
    orient="index",
    columns=["id", "name", "category", "public", "description"],
)

# Print the DataFrame
print(df.head(10))

    id name       category  public  \
1    1  ABC           Tech    True   
2    2  DEF           News   False   
3    3  GHI           Tech   False   
4    4  JKL        Finance    True   
5    5  MNO        Finance   False   
6    6  PQR     Healthcare    True   
7    7  STU      Education   False   
8    8  VWX           Tech    True   
9    9  YZA  Entertainment   False   
10  10  BCD         Retail    True   

                                          description  
1             AI research and deployment organization  
2                Global news and information services  
3   Leading provider of search and cloud technologies  
4           Financial services and investment banking  
5                   Simplified stock trading platform  
6   Global pharmaceutical and biotechnology corpor...  
7       Free online courses and educational resources  
8        Electric vehicles and clean energy solutions  
9               Subscription-based streaming platform  
10               E-co

## Connect to Milvus

In [4]:
from pymilvus import connections, utility

# Replace Placeholder Values <> with respective provisioned Milvus Values .
"""# On Prem
connections.connect(
            alias='default',
            host="<>",
            port=443,
            secure=True,
            server_pem_path="",
            server_name="<>",
            user="<>",
            password="<>")
# SaaS
connections.connect(
            alias='default',
            host="<>",
            port="<>",
            secure=True,
            user="<>",
            password="<>")"""

In [5]:
COLLECTION_NAME = "Milvus_test_bitmap"
DIMENSION = 384

In [6]:
if utility.has_collection(collection_name=COLLECTION_NAME):
    utility.drop_collection(collection_name=COLLECTION_NAME)

In [7]:
utility.has_collection(collection_name=COLLECTION_NAME)

False

## Create Milvus Schema
[more about schema](https://www.ibm.com/docs/en/watsonx/watsonxdata/2.1.x?topic=milvus-connecting-service#taskconctmilvus__postreq__1)

In [8]:
from pymilvus import DataType, FieldSchema, CollectionSchema, Collection

# Create schema
# Define the collection schema
fields = [
    FieldSchema(name="id", dtype=DataType.INT64, is_primary=True),
    FieldSchema(name="name", dtype=DataType.VARCHAR, max_length=20),  # Scalar field
    FieldSchema(name="category", dtype=DataType.VARCHAR, max_length=20),  # Scalar field
    FieldSchema(name="public", dtype=DataType.BOOL),  # Scalar field
    FieldSchema(name="description", dtype=DataType.VARCHAR, max_length=500),  # Scalar field
    FieldSchema(name="description_embedding", dtype=DataType.FLOAT_VECTOR, max_length=500, dim=DIMENSION),  # Vector field
]

schema = CollectionSchema(fields=fields)
collection = Collection(name=COLLECTION_NAME, schema=schema)

## Create Bitmap Index
[more on indexes](https://www.ibm.com/docs/en/watsonx/watsonxdata/2.1.x?topic=milvus-connecting-service#taskconctmilvus__postreq__1)

In [9]:
# Create index parameters
index_params= {
    "index_type": "BITMAP", # Type of index to be created
}
collection.create_index(field_name="category", index_params=index_params)

index_params={
    "index_type": "IVF_SQ8",
    "metric_type": "L2",
    "params": {"nlist": 128}
}
collection.create_index(field_name="description_embedding", index_params=index_params)

Status(code=0, message=)

## Generate Embedding for description field

In [10]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

  from .autonotebook import tqdm as notebook_tqdm


In [11]:
df['description_embedding'] = df['description'].apply(lambda x: model.encode(x).tolist())

# Show the DataFrame with embeddings
df

Unnamed: 0,id,name,category,public,description,description_embedding
1,1,ABC,Tech,True,AI research and deployment organization,"[-0.0011728338431566954, -0.06667085736989975,..."
2,2,DEF,News,False,Global news and information services,"[-0.036851268261671066, -0.04037269577383995, ..."
3,3,GHI,Tech,False,Leading provider of search and cloud technologies,"[-0.015076201409101486, -0.0003874574322253465..."
4,4,JKL,Finance,True,Financial services and investment banking,"[0.05151272192597389, -0.08495684713125229, -0..."
5,5,MNO,Finance,False,Simplified stock trading platform,"[-0.02130766399204731, 0.0012240050127729774, ..."
6,6,PQR,Healthcare,True,Global pharmaceutical and biotechnology corpor...,"[-0.031855326145887375, -0.04670903459191322, ..."
7,7,STU,Education,False,Free online courses and educational resources,"[0.013865390792489052, -0.037877537310123444, ..."
8,8,VWX,Tech,True,Electric vehicles and clean energy solutions,"[-0.04044750705361366, 0.10058614611625671, 0...."
9,9,YZA,Entertainment,False,Subscription-based streaming platform,"[-0.03334436193108559, -0.0880628451704979, -0..."
10,10,BCD,Retail,True,E-commerce and cloud computing giant,"[-0.0035822049248963594, 0.016643140465021133,..."


## Ingestion

In [12]:
data_to_insert = [
    {
        "id": int(row["id"]),
        "name": str(row["name"]),
        "category": str(row["category"]),
        "public": bool(row["public"]),
        "description": str(row["description"]),
        "description_embedding": row["description_embedding"],
    }
    for _, row in df.iterrows()
]

# Insert data into the Milvus collection
collection.insert(data_to_insert)

collection.flush()

# Check the number of entities in the collection
num_entities = collection.num_entities
print(f'Total number of entities in the collection: {num_entities}')

Total number of entities in the collection: 30


## Query
We will be using BitMap Index for two different usecases:
1. Querying and Filtering 
2. Filtering

### Query and filter

In [13]:
filter_expression = "category == 'Tech' and public == True"
question_text = "AI company"
question_vector = model.encode(question_text).tolist()

print("Embeddings:", question_vector)

Embeddings: [-0.08169576525688171, -0.05700130760669708, -0.008233768865466118, -0.00738370930776, 0.006671771872788668, -0.040133073925971985, 0.09387392550706863, 0.011836863122880459, 0.0327550433576107, -0.015086967498064041, -0.007696609944105148, -0.008191272616386414, 0.01079613622277975, -0.02455325610935688, -0.051503557711839676, 0.034926339983940125, -0.007094291970133781, -0.033282890915870667, -0.06825562566518784, -0.1306525021791458, -0.05625976622104645, -0.02847185917198658, -0.011119755916297436, -0.026763031259179115, -0.00579789187759161, 0.08918706327676773, 0.013626961968839169, -0.030455872416496277, 0.006723112892359495, -0.08422578126192093, -0.010452569462358952, -0.006856689229607582, 0.10541794449090958, -0.009761515073478222, -0.0034349907655268908, 0.055846184492111206, -0.08756192028522491, -0.027241693809628487, 0.074781134724617, -0.005701984744518995, -0.03196034952998161, -0.04344426468014717, 0.006464636884629726, -0.07108573615550995, 0.086317643523

In [14]:
collection.load()

res = collection.search(
    anns_field="description_embedding",
    data=[question_vector], # Replace with your query vector
    param={
    "metric_type": "L2",
    "params": {},
    },
    output_fields=["id","name", "category", "public","description"],
    expr=filter_expression,
    limit=10,
)

print("\nQuery results:")
for hits in res:
    for hit in hits:
        print(hit)


Query results:
id: 1, distance: 0.7781509160995483, entity: {'name': 'ABC', 'category': 'Tech', 'public': True, 'description': 'AI research and deployment organization', 'id': 1}
id: 29, distance: 1.3731958866119385, entity: {'name': 'GHI1', 'category': 'Tech', 'public': True, 'description': 'Professional networking platform', 'id': 29}
id: 11, distance: 1.3806469440460205, entity: {'name': 'EFG', 'category': 'Tech', 'public': True, 'description': 'Software development and cloud services', 'id': 11}
id: 13, distance: 1.4113643169403076, entity: {'name': 'KLM', 'category': 'Tech', 'public': True, 'description': 'Consumer electronics and software development', 'id': 13}
id: 23, distance: 1.5913578271865845, entity: {'name': 'OPQ', 'category': 'Tech', 'public': True, 'description': 'Social media and virtual reality technologies', 'id': 23}
id: 8, distance: 1.6900019645690918, entity: {'name': 'VWX', 'category': 'Tech', 'public': True, 'description': 'Electric vehicles and clean energy so

### Filtering

In [15]:
collection.load()

query_results = collection.query(
    expr=filter_expression,
    output_fields=["id","name", "category", "public","description"],
)

print("\nQuery results:")
for result in query_results:
    print(result)



Query results:
{'description': 'AI research and deployment organization', 'id': 1, 'name': 'ABC', 'category': 'Tech', 'public': True}
{'description': 'Electric vehicles and clean energy solutions', 'id': 8, 'name': 'VWX', 'category': 'Tech', 'public': True}
{'description': 'Software development and cloud services', 'id': 11, 'name': 'EFG', 'category': 'Tech', 'public': True}
{'description': 'Consumer electronics and software development', 'id': 13, 'name': 'KLM', 'category': 'Tech', 'public': True}
{'description': 'Social media and virtual reality technologies', 'id': 23, 'name': 'OPQ', 'category': 'Tech', 'public': True}
{'description': 'Professional networking platform', 'id': 29, 'name': 'GHI1', 'category': 'Tech', 'public': True}


## Dropping the collection

In [16]:
collection.drop()

## Conclusion
The above notebook illustrates the two different ways to use a BitMap Index for filtering, Querying and filtering and Filtering.

### Benefits of using BitMap Index
1. Fast pre-filtering before expensive similarity searches, reduces number of distance calculations required for find best similar results.
2. Memory Efficiency - bitmap indices can be more space-efficient than traditional B-tree indices.
3. Parallel Processing - Bitmap operations can be easily parallelized and takes advantage of modern CPU SIMD instructions, making them very performant for large-scale filtering operations.