# Text Similarity Search with Milvus in watsonx.data

## Disclaimers
- Use only Projects and Spaces that are available in watsonx context.

## Notebook Overview
This notebook demonstrates text similarity search support in watsonx.data, introducing commands for:
- Connecting to Milvus
- Creating collections
- Creating indexes
- Ingesting data
- Data retrieval

**Note**: Some familiarity with Python is helpful. This notebook uses Python 3.11.

## About Milvus

Milvus is an open-source vector database designed specifically for scalable similarity search and AI applications. It's a powerful platform that enables efficient storage, indexing, and retrieval of vector embeddings, which are crucial in modern machine learning and artificial intelligence tasks.

### Milvus: Three Fundamental Steps

#### 1. Data Preparation
Collect and convert your data into high-dimensional vector embeddings. These vectors are typically generated using machine learning models like neural networks, which transform text, images, audio, or other data types into dense numerical representations that capture semantic meaning and relationships.

#### 2. Vector Insertion
Load the vector embeddings into Milvus collections or partitions within a database. Milvus creates indexes to optimize subsequent search operations, supporting various indexing algorithms like IVF-FLAT, HNSW, etc., based on the definition.

#### 3. Similarity Search
Perform vector similarity searches by providing a query vector. Milvus will rapidly return the most similar vectors from the collection or partitions based on the defined metrics like cosine similarity, Euclidean distance, or inner product.

## Key Workflow

1. **Definition** (once)
2. **Ingestion** (once)
3. **Retrieve relevant passage(s)** (for every user query)

## Notebook Contents

- Environment Setup
- Install packages
- Document data loading
- Create connection
- Ingest data
- Retrieve relevant data

## Environment Setup

Before using the sample code in this notebook, complete the following setup tasks:

- Create a Watsonx.data instance (a free plan is offered)
  - Information about creating a watsonx.data instance can be found [here](https://www.ibm.com/docs/en/watsonx/watsonxdata/2.0.x)



## Install required packages

### Install Pymilvus SDK

In [1]:
!pip install -U pymilvus

zsh:1: command not found: pip


Next, we install sentence transformer to generate vector embeddings for text data. We could alternatively use watsonx.ai embedding models instead if you have API Key from watsonx.ai and a Watson Machine Learning instance integration. 

In [2]:
!pip install sentence-transformers

zsh:1: command not found: pip


In [3]:

import pandas as pd

try:
    from sentence_transformers import SentenceTransformer
    transformer = SentenceTransformer('all-MiniLM-L6-v2')
except ImportError:
    raise ImportError("Could not import sentence_transformers: Please install sentence-transformers package.")


  from .autonotebook import tqdm as notebook_tqdm


In [4]:
from sentence_transformers import SentenceTransformer

transformer = SentenceTransformer('all-MiniLM-L6-v2')

## Create Milvus connection

Replace placeholder values <> with their respective provisioned Milvus values.

In [5]:
from pymilvus import connections, Collection, FieldSchema, CollectionSchema, DataType,connections,utility

In [6]:
# On Prem
connections.connect(
            alias='default',
            host="<Milvus GRPC host On CPD>",
            port=443,
            secure=True,
            server_pem_path="<GRPC certificate path>",
            server_name="<Milvus GRPC host On CPD>",
            user="<user>",
            password="<password>")

In [7]:
# SaaS

connections.connect(
            alias='default',
            host="<Milvus GRPC host On SaaS",
            port="<port>",
            secure=True,
            user="<user>",
            password="<password>")

In [8]:
# Alternate syntex for the same

uri="https://<username>:<password>@<hostname>:<port>"
connections.connect(uri=uri)

From Milvus 2.4.0, 'MilvusClient' is introduced as a wrapper on existing methods. MilvusClient represents a client that connects to a specific Milvus instance. It serves as an easy-to-use alternative for handling Create, Read, Update, and Delete (CRUD) operations in Milvus. Code showing MilvusClient usage is part of a different notebook in the same repo.

## Load data

NOTE: The dataset we are using is already split into self-contained passages that can be ingested by Milvus. This preprocessing step is usually part of AI pipeline, outside of the scope of this demo.
You can either create your own dataset tailored to your specific needs or source one from reputable online repositories. Make sure it has the following columns.

In [9]:
test_data = pd.read_csv("../data/product_description_docs.csv")
test_data = test_data.head(100)
test_qna = pd.read_csv("../data/product_description_qna.csv")
test_data

Unnamed: 0,id,title,product_type,text
0,2765088,PRIKNIK Horn Red Electric Air Horn Compressor ...,7537,PRIKNIK Horn Red Electric Air Horn Compressor ...
1,1594019,ALISHAH Women's Cotton Ankle Length Leggings C...,2996,ALISHAH Women's Cotton Ankle Length Leggings C...
2,2152929,HINS Metal Bucket Shape Plant Pot for Indoor &...,5725,HINS Metal Bucket Shape Plant Pot for Indoor &...
3,2026580,Delavala Self Adhesive Kitchen Backsplash Wall...,6030,Delavala Self Adhesive Kitchen Backsplash Wall...
4,2998633,Hexwell Essential oil for Home Fragrance Oil A...,8201,Hexwell Essential oil for Home Fragrance Oil A...
...,...,...,...,...
95,2787933,"Whey Protein Isolate 24g Protein Per Serve,990...",11672,"Whey Protein Isolate 24g Protein Per Serve,990..."
96,1706246,Sleepwish Pink Caticorn Warm Sherpa Throw Blan...,1639,Sleepwish Pink Caticorn Warm Sherpa Throw Blan...
97,1891826,CityPostersPlus Kaptanganj Mouse pad,578,CityPostersPlus Kaptanganj Mouse pad[Mouse pad...
98,1535518,Wild Bobby Straight Outta Cleveland CLE Fan | ...,2879,Wild Bobby Straight Outta Cleveland CLE Fan | ...


In [10]:
# Define static properties

COLLECTION_NAME = "Milvus_collection"
DIMENSION = 384
BATCH_SIZE = 2
TOPK = 1
fmt = "=== {:30} ==="
search_latency_fmt = "search latency = {:.4f}s"

### Create Milvus Collection 

#### Check if collection exists, drop if exists already and create new.

In [11]:
if utility.has_collection(COLLECTION_NAME):
    utility.drop_collection(COLLECTION_NAME)

In [12]:
utility.has_collection(COLLECTION_NAME)

False

In [13]:

# object should be inserted in the format of (title, date, location, speech embedding)
fields = [
    FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=False),
    FieldSchema(name="text", dtype=DataType.VARCHAR, max_length=65535),
    FieldSchema(name="title", dtype=DataType.VARCHAR, max_length=65535),
    FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=DIMENSION)
]
schema = CollectionSchema(fields=fields)
collection = Collection(name=COLLECTION_NAME, schema=schema)

In [14]:
# define user defined method for ingestion. You can define your own ingestion methods based on your data and schema.

def embed_insert(data: list):
    embeddings = transformer.encode(data[2])
    ins = [
        data[0],
        data[1],
        data[2],
        [x for x in embeddings]
        
    ]
    collection.insert(ins)

In [15]:
# Define parameters like index type, similarity metrics and nlist params
index_params = {
    "index_type": "IVF_FLAT",
    "metric_type": "L2",
    "params": {"nlist": 128},
}
collection.create_index(field_name="embedding", index_params=index_params)
collection.load()

## Insert Data

In [16]:
import sys
def batch_insert_data(data):
    data_batch = [[], [], []]

    batch_iteration = 0
    for index, row in data.iterrows():
        batch_iteration = batch_iteration + 1
        print(batch_iteration)
        data_batch[0].append(row["id"])
        data_batch[1].append(row["title"])
        data_batch[2].append(row["text"])
        if len(data_batch[0]) % BATCH_SIZE == 0:
            #print(len(data_batch[0]))
            embed_insert(data_batch)
            data_batch = [[], [], []]

    # Embed and insert the remainder
    if len(data_batch[0]) != 0:
        embed_insert(data_batch)

    # Call a flush to index any unsealed segments.
    collection.flush()

In [17]:
batch_insert_data(test_data.head(100))

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100


## Semantic Search

In [18]:
question_texts = [q.strip("?") + "?" for q in test_qna.iloc[5:10]['question'].tolist()]
print("\n".join(question_texts))

I want to buy special gift for my husband.?
Do you have movie stickers for kids?
I am looking for kitchen essentials?
Sports outfits for men?
Pink coloured dress for women?


In [19]:
def embed_search(data):
    embeds = transformer.encode(data) 
    return [x for x in embeds]

search_data = embed_search("I want to buy special gift for my husband.?")


In [20]:
import time

search_terms = question_texts

# Search the database based on input text
def embed_search(data):
    embeds = transformer.encode(data) 
    return [x for x in embeds]

search_data = embed_search(search_terms)

start = time.time()

res = collection.search(
    data=search_data,  # Embeded search value
    anns_field="embedding",  # Search across embeddings
    param={"metric_type": "L2",
            "params": {"nprobe": 10}},
    limit = 5,
    #limit = 1,# Limit to top_k results per search
    output_fields=["id","text","title"]  # Include title field in result
)
end = time.time()

result_ids = {}
for hits_i, hits in enumerate(res):
    hit_ids = []
    print("Question:", search_terms[hits_i])
    print("Search Time:", end-start)
    print("Results:")
    for hit in hits:
        hit_ids.append(hit.entity.get("id"))
        
        print( "Result for question", hits_i+1, "----",hit.entity.get("text"), "----", hit.distance)
    print("\n")
    result_ids[hits_i] = hit_ids

Question: I want to buy special gift for my husband.?
Search Time: 0.9227221012115479
Results:
Result for question 1 ---- arythe Romantic LED Light Valentine's Day Sign with Suction Cup Wedding Decor Marry Me Warm White ---- 1.3304294347763062
Result for question 1 ---- Suitcase Music Box, Mini Music Box Clockwork Music Box for Children ---- 1.3519076108932495
Result for question 1 ---- MJ Metals Jewelry White Ceramic Piano Keyboard 6mm Band Flat Pipe Cut High Polished Ring Size 12 ---- 1.473937749862671
Result for question 1 ---- Aqualens Moon Blue A-8065 Contact Lens Designer Case ---- 1.4865772724151611
Result for question 1 ---- K-Swiss 201 Classic Tennis Shoe (Infant/Toddler),Black/Black,5 M US Toddler ---- 1.57752525806427


Question: Do you have movie stickers for kids?
Search Time: 0.9227221012115479
Results:
Result for question 2 ---- Sandylion Marvel Heroes Foldover Stickers ---- 1.0077579021453857
Result for question 2 ---- One Piece Group SD Sticker Set Anime ---- 1.1542098

## Drop the collection

In [21]:
# Drop a specific collection
utility.drop_collection(
    collection_name="Milvus_collection",
)