# Milvus Tutorial

### Imports

In [1]:
from pymilvus import MilvusClient # Import Milvus client

### Connect to database

In [2]:
client = MilvusClient(uri="http://localhost:19530", token="root:Milvus") # Instiantiate Milvus Client (creates database)

### Create collection

Then, create a collection. This stores vectors and associated metadata. We can define schema and index params to configure things like dimensionality, index types and distant metrics.

In [3]:
if client.has_collection(collection_name="demo_collection"): # if collection already exist
    client.drop_collection(collection_name="demo_collection") # drop it
client.create_collection( # Create collection in database
    collection_name="demo_collection", # Specify collection name
    dimension=768,  # The vectors we will use in this demo has 768 dimensions
) # We can set up many other things, but for now let's use defaults

In the above setup, the primary key and vector fields use the default name ("id" and "vector"). The distance is COSINE. Primary key accepts integers, and is not auto-incrementative.

### Generate embeddings

In this tutorial, we are going to use text embeddings. Thus, we need a text embedding model.

In [4]:
from sentence_transformers import SentenceTransformer

embedding_fn = SentenceTransformer('sentence-transformers/paraphrase-albert-small-v2')

# Text strings to search from.
docs = [
    "Artificial intelligence was founded as an academic discipline in 1956.",
    "Alan Turing was the first person to conduct substantial research in AI.",
    "Born in Maida Vale, London, Turing was raised in southern England.",
]

vectors = embedding_fn.encode(docs) # We vectorize each of the sentences

# The output vector has 768 dimensions, matching the collection that we just created.
print("Dim:", vectors[0].shape)

  from tqdm.autonotebook import tqdm, trange
2024-06-25 13:19:46.900572: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-06-25 13:19:46.900603: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-06-25 13:19:46.901775: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-06-25 13:19:46.908797: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


Dim: (768,)


### Insert embeddings in database

In [5]:
# Each entity has id, vector representation, raw text, and a subject label that we use
# to demo metadata filtering later.
data = [
    {"id": i, "vector": vectors[i], "text": docs[i], "subject": "history"} # We include a field called subject (which is actually the same for every entry)
    for i in range(len(vectors)) # Note that the actual text is inserted as a field as well.
]

print("Data has", len(data), "entities, each with fields: ", data[0].keys())
print("Vector dim:", len(data[0]["vector"]))

Data has 3 entities, each with fields:  dict_keys(['id', 'vector', 'text', 'subject'])
Vector dim: 768


In [6]:
data[2]

{'id': 2,
 'vector': array([ 2.64879107e-01, -3.88542831e-01, -1.05445337e+00, -2.46015340e-01,
         1.62731543e-01,  7.47957230e-01,  5.21558225e-01,  7.92254508e-01,
         2.69304484e-01,  2.56194808e-02, -7.88767397e-01,  3.01118493e-01,
        -8.35242391e-01, -5.09805441e-01, -9.15115654e-01, -1.92508772e-01,
        -1.48965582e-01,  2.85092771e-01, -7.27171972e-02,  6.94321513e-01,
        -4.88077998e-01, -8.53572488e-01, -3.96034569e-02,  1.02947168e-01,
         5.25055647e-01,  1.24397211e-01, -3.74307573e-01, -2.12436602e-01,
        -3.40664126e-02,  8.25250447e-02, -2.26377010e-01, -3.31184030e-01,
         1.26333237e-01,  2.76342332e-01,  2.98249990e-01, -2.28520662e-01,
        -3.99868429e-01, -9.13797766e-02,  3.00882995e-01, -5.82711548e-02,
        -5.50903082e-01, -1.47391811e-01, -6.20579958e-01, -1.75256819e-01,
        -2.28549302e-01, -3.18194360e-01,  1.91852018e-01,  1.51335642e-01,
         1.97326854e-01,  1.30389139e-01, -3.59012932e-01, -2.433136

Now, insert the data

In [7]:
res = client.insert(collection_name="demo_collection", data=data)

print(res)

{'insert_count': 3, 'ids': [0, 1, 2], 'cost': 0}


### Perform a search

We can query the database

In [9]:
query_vectors = embedding_fn.encode(["Who is Alan Turing?"]) # Query text to embedding

res = client.search(
    collection_name="demo_collection",  # target collection
    data=query_vectors, # query vectors (as list)
    limit=2, # number of returned entities
    output_fields=["text", "subject"],  # specifies fields to be returned
)

print(res)


data: ["[{'id': 2, 'distance': 0.5859944820404053, 'entity': {'text': 'Born in Maida Vale, London, Turing was raised in southern England.', 'subject': 'history'}}, {'id': 1, 'distance': 0.511825680732727, 'entity': {'text': 'Alan Turing was the first person to conduct substantial research in AI.', 'subject': 'history'}}]"] , extra_info: {'cost': 0}


Output is a list of results. We get the id, the distance and the entity, which is a dictionary including the selected output fields.

### Perform filtering

We can perform vector search with metadata filtering.

In [11]:
# Insert more docs in another subject.
docs = [
    "Machine learning has been used for drug design.",
    "Computational synthesis with AI algorithms predicts molecular properties.",
    "DDR1 is involved in cancers and fibrosis.",
]
vectors = embedding_fn.encode(docs)
data = [
    {"id": 3 + i, "vector": vectors[i], "text": docs[i], "subject": "biology"} # Different subject
    for i in range(len(vectors))
]

client.insert(collection_name="demo_collection", data=data) # Insert

# This will exclude any text in "history" subject despite close to the query vector.
res = client.search(
    collection_name="demo_collection",
    data=embedding_fn.encode(["tell me AI related information"]),
    filter="subject == 'biology'", # This is the filter
    limit=2,
    output_fields=["text", "subject"],
)

print(res)

data: ["[{'id': 4, 'distance': 0.2703056335449219, 'entity': {'text': 'Computational synthesis with AI algorithms predicts molecular properties.', 'subject': 'biology'}}, {'id': 3, 'distance': 0.16425906121730804, 'entity': {'text': 'Machine learning has been used for drug design.', 'subject': 'biology'}}]"] , extra_info: {'cost': 0}


Query is different from search:

In [77]:
res = client.query(
    collection_name="demo_collection",
    filter="subject == 'history'",
    output_fields=["text", "subject"],
)

In [78]:
res

data: ["{'text': 'Artificial intelligence was founded as an academic discipline in 1956.', 'subject': 'history', 'id': 0}", "{'text': 'Alan Turing was the first person to conduct substantial research in AI.', 'subject': 'history', 'id': 1}", "{'text': 'Born in Maida Vale, London, Turing was raised in southern England.', 'subject': 'history', 'id': 2}"] , extra_info: {'cost': 0}

In [79]:
res = client.query(
    collection_name="demo_collection",
    ids=[0, 2],
    output_fields=["vector", "text", "subject"],
)

In [80]:
res

data: ["{'vector': [0.010727894, -0.035895176, 0.018749746, 0.016348809, 0.03651688, 0.0035882127, -0.0004004981, 0.028529339, 0.0022745496, 0.0018362562, 0.004225859, 0.02717393, -0.0036843517, 0.03079148, 0.0045053945, 0.0442281, 0.010503842, -0.029494502, -0.06707345, -0.02052648, 0.015322829, -0.0060049067, -0.0622855, -0.039614785, 0.014206286, 0.032707665, -0.020834554, -0.044174287, -0.028339839, 0.029424427, -0.028087234, -0.020809026, 0.017159794, 0.0021116603, 0.021823755, -0.001577665, -0.037696734, 0.041460764, -0.02505642, 0.083336696, -0.015979158, 0.009813834, -0.026605438, 0.00061898713, 0.0037358247, -0.034155212, 0.058708236, -0.023721864, 0.0067459764, -0.03584181, -0.017560031, 0.022803081, 0.0026646194, 0.02511928, 0.046945184, 0.012622784, 0.018337477, -0.0071655777, 0.042811263, 0.0050310283, 0.05705552, -0.014866053, 0.10045472, 0.0064573605, -0.06832718, -0.016902734, 0.011977218, -0.0015766147, -0.022061639, 0.021156568, 0.04071084, -0.03600993, 0.019723987, -

### Delete data

Data deletion

In [81]:
res = client.delete(collection_name="demo_collection", ids=[0, 2])

print(res)

res = client.delete(
    collection_name="demo_collection",
    filter="subject == 'biology'",
)

print(res)


{'delete_count': 2, 'cost': 0}
{'delete_count': 3, 'cost': 0}


In [82]:
res = client.query(
    collection_name="demo_collection",
    filter="subject == 'history' || subject == 'biology'",
    output_fields=["text", "subject"],
)

print(res)

data: ["{'text': 'Alan Turing was the first person to conduct substantial research in AI.', 'subject': 'history', 'id': 1}"] , extra_info: {'cost': 0}


### Reconnect

All the operations performed above have been saved in the file. So, we can load the file and the infor is still there.

In [83]:
from pymilvus import MilvusClient

client = MilvusClient(uri="http://localhost:19530", token="root:Milvus")


In [84]:
res = client.query(
    collection_name="demo_collection",
    filter="subject == 'history' || subject == 'biology'",
    output_fields=["text", "subject"],
)

print(res)

data: ["{'text': 'Alan Turing was the first person to conduct substantial research in AI.', 'subject': 'history', 'id': 1}"] , extra_info: {'cost': 0}


### Drop collection

Finally, if we want to drop the whole collection:

In [85]:
# Drop collection
client.drop_collection(collection_name="demo_collection")


In [86]:
res = client.query(
    collection_name="demo_collection",
    filter="subject == 'history' || subject == 'biology'",
    output_fields=["text", "subject"],
)

print(res)

RPC error: [describe_collection], <DescribeCollectionException: (code=100, message=can't find collection[database=default][collection=demo_collection])>, <Time:{'RPC start': '2024-06-25 12:38:53.224521', 'RPC error': '2024-06-25 12:38:53.226556'}>
Failed to describe collection: demo_collection


DescribeCollectionException: <DescribeCollectionException: (code=100, message=can't find collection[database=default][collection=demo_collection])>