<a href="https://colab.research.google.com/github/KarAnalytics/code_demos/blob/main/VectorEmbedding_ISResearch_MilvusLite.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Vectors, the output data format of Neural Network models, can effectively encode information and serve a pivotal role in AI applications such as knowledge base, semantic search, Retrieval Augmented Generation (RAG) and more.

Milvus is an open-source vector database that suits AI applications of every size from running a demo chatbot in Jupyter notebook to building web-scale search that serves billions of users. In this guide, we will walk you through how to set up Milvus locally within minutes and use the Python client library to generate, store and search vectors.

## Install Milvus
In this guide we use Milvus Lite, a python library included in `pymilvus` that can be embedded into the client application. Milvus also supports deployment on [Docker](https://milvus.io/docs/install_standalone-docker.md) and [Kubernetes](https://milvus.io/docs/install_cluster-milvusoperator.md) for production use cases.

Before starting, make sure you have Python 3.8+ available in the local environment. Install `pymilvus` which contains both the python client library and Milvus Lite:

In [1]:
#!pip install -U pymilvus
!pip install pymilvus[milvus_lite]
## Ignore the error that is listed in the bottom, the package still gets downloaded

!pip install "pymilvus[model]"
## Since this asks you to restart the session. Restart the session in the Runtime tab.
### Then again go to Runtime tab and click on Run before to run the lines of code before this cell

# Install sentence_transformers if not already installed (will be skipped if installed)
!pip install sentence-transformers


Collecting pymilvus[milvus_lite]
  Downloading pymilvus-2.6.8-py3-none-any.whl.metadata (6.8 kB)
Collecting milvus-lite>=2.4.0 (from pymilvus[milvus_lite])
  Downloading milvus_lite-2.5.1-py3-none-manylinux2014_x86_64.whl.metadata (10.0 kB)
Downloading milvus_lite-2.5.1-py3-none-manylinux2014_x86_64.whl (55.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.3/55.3 MB[0m [31m13.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pymilvus-2.6.8-py3-none-any.whl (300 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m301.0/301.0 kB[0m [31m14.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: milvus-lite, pymilvus
Successfully installed milvus-lite-2.5.1 pymilvus-2.6.8
Collecting pymilvus.model>=0.3.0 (from pymilvus[model])
  Downloading pymilvus_model-0.3.2-py3-none-any.whl.metadata (2.0 kB)
Collecting onnxruntime (from pymilvus.model>=0.3.0->pymilvus[model])
  Downloading onnxruntime-1.23.2-cp312-cp312-manylinux_2_27_x86_64.manylinux

> If you are using Google Colab, to enable dependencies just installed, you may need to **restart the runtime** (click on the "Runtime" menu at the top of the screen, and select "Restart session" from the dropdown menu).

## Set Up Vector Database
To create a local Milvus vector database, simply instantiate a `MilvusClient` by specifying a file name to store all data, such as "milvus_demo.db".

In [2]:
from pymilvus import MilvusClient


In [3]:
client = MilvusClient("bsan765.db")

## Create a Collection
In Milvus, we need a collection to store vectors and their associated metadata. You can think of it as a table in traditional SQL databases. When creating a collection, you can define schema and index params to configure vector specs such as dimensionality, index types and distant metrics. There are also complex concepts to optimize the index for vector search performance. For now, let's just focus on the basics and use default for everything possible. At minimum, you only need to set the collection name and the dimension of the vector field of the collection.

In [4]:
if client.has_collection(collection_name="ISResearch"):
    client.drop_collection(collection_name="ISResearch")
client.create_collection(
    collection_name="ISResearch",
    dimension=384
)

In the above setup,
- The primary key and vector fields use their default names ("id" and "vector").
- The metric type (vector distance definition) is set to its default value ([COSINE](https://milvus.io/docs/metric.md#Cosine-Similarity)).
- The primary key field accepts integers and does not automatically increments (namely not using [auto-id feature](https://milvus.io/docs/schema.md))
Alternatively, you can formally define the schema of the collection by following this [instruction](https://milvus.io/api-reference/pymilvus/v2.4.x/MilvusClient/Collections/create_schema.md).

## Prepare Data
In this guide, we use vectors to perform semantic search on text. We need to generate vectors for text by downloading embedding models. This can be easily done by using the utility functions from `pymilvus[model]` library.

## Represent text with vectors
First, install the model library. This package includes essential ML tools such as PyTorch. The package download may take some time if your local environment has never installed PyTorch.

Generate vector embeddings with default model. Milvus expects data to be inserted organized as a list of dictionaries, where each dictionary represents a data record, termed as an entity.

In [5]:
from pymilvus import model
import pandas as pd

# If connection to https://huggingface.co/ failed, uncomment the following path
# import os
# os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'

# This will download a small embedding model "paraphrase-albert-small-v2" (~50MB).
#embedding_fn = model.DefaultEmbeddingFunction()

## This will use sentence transformers
embedding_fn = model.dense.SentenceTransformerEmbeddingFunction(
    model_name="all-MiniLM-L6-v2"
)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]



vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [6]:
df = pd.read_csv("ISResearch.csv")

In [7]:
df.head()

Unnamed: 0,id,Year,Title,Abstract,URL,JournalFN
0,1,2024,Digital Approaches to Societal Grand Challenge...,Information systems (IS) scholars have pursued...,https://pubsonline.informs.org/doi/abs/10.1287...,Information Systems Research
1,2,2024,Mr. Right or Mr. Best: The Role of Information...,This paper examines the role of information in...,https://pubsonline.informs.org/doi/abs/10.1287...,Information Systems Research
2,3,2024,How Information Technology Overcomes Deficienc...,Innovation is vital for the growth of small an...,https://pubsonline.informs.org/doi/abs/10.1287...,Information Systems Research
3,4,2024,Strategic Expectation Setting of Delivery Time...,Delivery speed is an essential component of th...,https://pubsonline.informs.org/doi/abs/10.1287...,Information Systems Research
4,5,2024,User-Generated Content Shapes Judicial Reasoni...,Legal professionals have access to many differ...,https://pubsonline.informs.org/doi/abs/10.1287...,Information Systems Research


In [8]:
df.shape

(227, 6)

In [9]:
df.head()

Unnamed: 0,id,Year,Title,Abstract,URL,JournalFN
0,1,2024,Digital Approaches to Societal Grand Challenge...,Information systems (IS) scholars have pursued...,https://pubsonline.informs.org/doi/abs/10.1287...,Information Systems Research
1,2,2024,Mr. Right or Mr. Best: The Role of Information...,This paper examines the role of information in...,https://pubsonline.informs.org/doi/abs/10.1287...,Information Systems Research
2,3,2024,How Information Technology Overcomes Deficienc...,Innovation is vital for the growth of small an...,https://pubsonline.informs.org/doi/abs/10.1287...,Information Systems Research
3,4,2024,Strategic Expectation Setting of Delivery Time...,Delivery speed is an essential component of th...,https://pubsonline.informs.org/doi/abs/10.1287...,Information Systems Research
4,5,2024,User-Generated Content Shapes Judicial Reasoni...,Legal professionals have access to many differ...,https://pubsonline.informs.org/doi/abs/10.1287...,Information Systems Research


In [10]:
abstract = df['Abstract'].tolist()
df['vector'] = embedding_fn.encode_documents(abstract)

In [11]:
df.head()

Unnamed: 0,id,Year,Title,Abstract,URL,JournalFN,vector
0,1,2024,Digital Approaches to Societal Grand Challenge...,Information systems (IS) scholars have pursued...,https://pubsonline.informs.org/doi/abs/10.1287...,Information Systems Research,"[-0.001679635, 0.014301069, -0.043375883, -0.0..."
1,2,2024,Mr. Right or Mr. Best: The Role of Information...,This paper examines the role of information in...,https://pubsonline.informs.org/doi/abs/10.1287...,Information Systems Research,"[-0.008911192, 0.031858947, 0.06066314, 0.0195..."
2,3,2024,How Information Technology Overcomes Deficienc...,Innovation is vital for the growth of small an...,https://pubsonline.informs.org/doi/abs/10.1287...,Information Systems Research,"[0.018558864, -0.039843395, 0.0014439853, -0.0..."
3,4,2024,Strategic Expectation Setting of Delivery Time...,Delivery speed is an essential component of th...,https://pubsonline.informs.org/doi/abs/10.1287...,Information Systems Research,"[0.024678467, -0.04312548, 0.08284558, 0.01446..."
4,5,2024,User-Generated Content Shapes Judicial Reasoni...,Legal professionals have access to many differ...,https://pubsonline.informs.org/doi/abs/10.1287...,Information Systems Research,"[-0.027353816, 0.00544577, -0.032940563, -0.07..."


In [12]:
## Milvus database requires a dict to be passed to the collection:
df_dict = df.to_dict(orient='records')

In [13]:
res = client.insert(collection_name="ISResearch", data=df_dict)

print(res)

{'insert_count': 227, 'ids': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216

## Semantic Search
Now we can do semantic searches by representing the search query text as vector, and conduct vector similarity search on Milvus.

### Vector search
Milvus accepts one or multiple vector search requests at the same time. The value of the query_vectors variable is a list of vectors, where each vector is an array of float numbers.

In [14]:
query_vectors = embedding_fn.encode_queries(["Which papers mention blockchain?"])

res = client.search(
    collection_name="ISResearch",  # target collection
    data=query_vectors,  # query vectors
    limit=5,  # number of returned entities
    output_fields=["Title", "URL"],  # specifies fields to be returned
)

print(res)

data: [[{'id': 99, 'distance': 0.40516090393066406, 'entity': {'Title': 'Foundations of Decentralized Metaverse Economies: Converging Physical and Virtual Realities', 'URL': 'https://www.jmis-web.org/articles/1702'}}, {'id': 54, 'distance': 0.3404100835323334, 'entity': {'Title': 'Digitization of Transaction Terms within TCE: Strong Smart Contract as a New Mode of Transaction Governance', 'URL': 'https://misq.umn.edu/digitization-of-transaction-terms-shift-parameter-within-tce-strong-smart-contract-as-a-new-mode-of-transaction-governance.html'}}, {'id': 157, 'distance': 0.3348878026008606, 'entity': {'Title': 'Digital Transformation of Academic Publishing:  A Call for the Decentralization and Democratization of Academic Journals', 'URL': 'https://aisel.aisnet.org/jais/vol25/iss1/1'}}, {'id': 186, 'distance': 0.32962146401405334, 'entity': {'Title': 'Uncovering the Structural Assurance Mechanisms  in Blockchain Technology-Enabled Online Healthcare Mutual Aid Platforms', 'URL': 'https://

The output is a list of results, each mapping to a vector search query. Each query contains a list of results, where each result contains the entity primary key, the distance to the query vector, and the entity details with specified `output_fields`.

## Load Existing Data
Since all data of Milvus Lite is stored in a local file, you can load all data into memory even after the program terminates, by creating a `MilvusClient` with the existing file. For example, this will recover the collections from "milvus_demo.db" file and continue to write data into it.

In [15]:
from pymilvus import MilvusClient

client = MilvusClient("bsan765.db")

## Drop the collection
If you would like to delete all the data in a collection, you can drop the collection with

In [16]:
# Drop collection
client.drop_collection(collection_name="ISResearch")