MIT License - Non LLM

Copyright (c) [2025] [Daksh Gupta (aka Deepak Kumar Gupta)]

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

### Milvus (lite) Open Source Vector Database

<br/><br/>

#### Problem statement
<br/>

___Using exerpts of the book, search the title and author of the book___

<br/><br/>
<br/>

In [None]:
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"


import pandas as pd

In [None]:
book_data = [
    {
        "title": "Moby Dick",
        "author": "Herman Melville",
        "excerpt": "Call me Ishmael. Some years ago—never mind how long precisely—having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world. It is a way I have of driving off the spleen and regulating the circulation."
    },
    {
        "title": "Dune",
        "author": "Frank Herbert",
        "excerpt": "A beginning is the time for taking the most delicate care that the balances are correct. To begin your study of the life of Muad'Dib, then, take care that you first place him in his time: born in the 57th year of the Padishah Emperor, Shaddam IV. And take the most special care that you locate him in his place: the planet Arrakis, known as Dune."
    },
    {
        "title": "Nineteen Eighty-Four",
        "author": "George Orwell",
        "excerpt": "It was a bright cold day in April, and the clocks were striking thirteen. Winston Smith, his chin nuzzled into his breast in an effort to escape the vile wind, slipped quickly through the glass doors of Victory Mansions, though not quickly enough to prevent a swirl of gritty dust from entering along with him."
    },
    {
        "title": "The Hobbit",
        "author": "J.R.R. Tolkien",
        "excerpt": "In a hole in the ground there lived a hobbit. Not a nasty, dirty, wet hole, filled with the ends of worms and an oozy smell, nor yet a dry, bare, sandy hole with nothing in it to sit down on or to eat: it was a hobbit-hole, and that means comfort."
    },
    {
        "title": "Meditations",
        "author": "Marcus Aurelius",
        "excerpt": "You have power over your mind - not outside events. Realize this, and you will find strength. The happiness of your life depends upon the quality of your thoughts: therefore, guard accordingly, and take care that you entertain no notions unsuitable to virtue and reasonable nature."
    },
    {
        "title": "A Study in Scarlet",
        "author": "Arthur Conan Doyle",
        "excerpt": "From a drop of water, a logician could infer the possibility of an Atlantic or a Niagara without having seen or heard of one or the other. So all life is a great chain, the nature of which is known whenever we are shown a single link of it. This is the science of deduction and analysis."
    },
    {
        "title": "Pride and Prejudice",
        "author": "Jane Austen",
        "excerpt": "It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife. However little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered the rightful property of some one or other of their daughters."
    }
]


In [None]:
df = pd.DataFrame(book_data)

In [None]:
df

In [None]:
from pymilvus import MilvusClient, DataType
from sentence_transformers import SentenceTransformer

<br/><br/>

In [None]:
# Text embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')

<br/><br/>
___validate the model encodings and get the vector dimension___
<br/><br/>

In [None]:
vector_sample = model.encode("This is a test message", show_progress_bar=True)
DIMENSION = len(vector_sample)
print (f"Vector Dimension: {DIMENSION}")

<br/><br/>
___Create a local vector database___
<br/><br/>

In [None]:
vec_db_client = MilvusClient(uri='books.db')


<br/><br/>
___Create the DB schema___
<br/><br/>

In [None]:
vec_db_schema = MilvusClient.create_schema(
        auto_id=False,  
        enable_dynamic_field=False,
    )

<br/><br/>
___Create schema using the primary key___
<br/>
_book 'title' will be unique, hence can be used as the primary key_
<br/><br/>

In [None]:
vec_db_schema.add_field(field_name="title", datatype=DataType.VARCHAR, max_length=255, is_primary=True, auto_id=False)

<br/><br/>
___Add other fields to the schema___
<br/><br/>

In [None]:
vec_db_schema.add_field(field_name="author", datatype=DataType.VARCHAR, max_length=255)
vec_db_schema.add_field(field_name="excerpt", datatype=DataType.VARCHAR, max_length=1024)

<br/><br/>
___Add the actual field which will be used for holding the vector embeddings___
<br/><br/>


In [None]:
vec_db_schema.add_field(field_name="vector_emb", datatype=DataType.FLOAT_VECTOR, dim=DIMENSION)

<br/><br/>
___Create the vector embeddings and store the data in the dataframe___
<br/><br/>

In [None]:
df['vector_emb'] = df['excerpt'].apply(lambda x: model.encode(x, show_progress_bar=True))

<br/><br/>

In [None]:
df

<br/><br/>
___Create the index parameters and add the database field that needes to be indexed___
<br/><br/>

In [None]:
index_params = MilvusClient.prepare_index_params()

index_params.add_index(
    field_name="vector_emb",
    index_type="IVF_FLAT",
    metric_type="COSINE",
    params={"nlist": 16}
)

<br/><br/>
___Create a collection in the vector database and store the data frame in the collection___
<br/><br/>

In [None]:
vec_db_client.create_collection(collection_name="books", schema=vec_db_schema, index_params=index_params)

<br/><br/>
___Insert the data into the collection___



In [None]:
vec_db_client.insert(collection_name="books", data=df.to_dict('records'))

<br/><br/>

___Now Vector DB contains the data, we can perform a similarity search___

<br/><br/>
<br/><br/>

In [None]:
search_which_book = "bright cold day in April, and the clocks were striking thirteen."

<br/><br/>
___Encode the query text to vector___
<br/><br/>

In [None]:
query_vector = model.encode(search_which_book, show_progress_bar=True)

<br/><br/>
___Search for the closest match and find the top 2 nearest matches___
<br/><br/>

In [None]:
search_result = vec_db_client.search(
    collection_name="books",
    data=[query_vector],
    anns_field="vector_emb",
    limit=2,
    output_fields=["title", "author"]
)

<br/><br/>
___See and verify the results___
<br/><br/>

In [None]:
# length of results - must be 2
len(search_result[0])

<br/><br/>

In [None]:
# first result and cosing values
print(search_result[0][0]['entity'])
print(search_result[0][0]['distance'])

<br/><br/>

In [None]:
# Second result and cosing values
print(search_result[0][1]['entity'])
print(search_result[0][1]['distance'])

<br/><br/>
___Close the vector database___
<br/><br/>

In [None]:
vec_db_client.close()

<br/><br/>
<br/><br/>
<br/><br/>

____Reusing the database____

<br/><br/>
<br/>

In [None]:
vec_db_client_resuse = MilvusClient(uri='books.db')

In [None]:
search_which_book_again = " sandy hole with nothing in it to sit down on or to eat: it was a hobbit-hole"

In [None]:
# encode the query text to vector

query_vector_again = model.encode(search_which_book_again, show_progress_bar=True)

In [None]:
# search the vector database

search_result_again = vec_db_client_resuse.search(
    collection_name="books",
    data=[query_vector_again],
    anns_field="vector_emb",
    limit=5,
    output_fields=["title", "author"]
)

In [None]:
# display the results

print(search_result_again[0][0]['entity'])
print(search_result_again[0][0]['distance'])

In [None]:
# display    2nd results

print(search_result_again[0][1]['entity'])
print(search_result_again[0][1]['distance'])


<br/><br/>
<br/><br/>
<br/><br/>
<br/><br/>

<br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/>