Welcome to this interactive session on Milvus, the open-source vector database! In this notebook, you'll learn the core operations of Milvus by interacting with **Milvus Lite**, a lightweight version that runs directly in this Colab environment.

### Prerequisites

* **Google Colab:** No local setup required, just your web browser!
* **Basic Python knowledge:** Familiarity with Python syntax and data structures.

### Step 0: Install Libraries 📦

First, we need to install the necessary Python libraries.
* `pymilvus`: The official Python SDK for Milvus. It includes Milvus Lite.
* `sentence-transformers`: A library to easily generate vector embeddings from text.

In [None]:
# Install pymilvus with Milvus Lite
!pip install pymilvus==2.4.10

# Install sentence-transformers for generating embeddings
!pip install sentence-transformers

print("Libraries installed successfully! 🎉")

### Step 1: Initialize Milvus Lite & Connect 🤝

Milvus Lite runs as an embedded process. When you create a `MilvusClient` instance and specify a local file path as the URI, Milvus Lite will automatically initialize and store all its data in that file.

In [None]:
from pymilvus import MilvusClient, DataType, FieldSchema, CollectionSchema, MilvusException
from sentence_transformers import SentenceTransformer
import os

# Define a file path for Milvus Lite to store its data.
# This file will be created in your Colab environment.
DB_FILE_PATH = "./milvus_demo.db"
# Remove the database file if it exists from a previous run
if os.path.exists(DB_FILE_PATH):
    os.remove(DB_FILE_PATH)
    print(f"Removed existing database file: {DB_FILE_PATH}")

# Initialize MilvusClient. This starts Milvus Lite.
client = MilvusClient(uri=DB_FILE_PATH)

print(f"Connected to Milvus Lite, data will be stored in: {DB_FILE_PATH} ✅")
print(f"Milvus client connected: {client is not None}")

DEBUG:pymilvus.milvus_client.milvus_client:Created new connection using: 285ee9c7dc184fb0b2becda788824507


Connected to Milvus Lite, data will be stored in: ./milvus_demo.db ✅
Milvus client connected: True


### Step 2: Define Collection Schema & Create Collection 🧩

In Milvus, data is organized into **Collections**, which are similar to tables in a relational database. Each collection has a **schema** defining its fields, including at least one primary key and one vector field.

In [None]:
from pymilvus import connections
COLLECTION_NAME = "milvus_training_collection"
DIMENSION = 384 # This dimension is specific to the 'all-MiniLM-L6-v2' model

# Define the schema for our collection
# 1. 'id': A unique identifier for each document (primary key).
# 2. 'raw_text': The original text content (metadata).
# 3. 'raw_text_vector': The vector representation of the text.
fields = [
    FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
    FieldSchema(name="raw_text", dtype=DataType.VARCHAR, max_length=2048),
    FieldSchema(name="category", dtype=DataType.VARCHAR, max_length=100),
    FieldSchema(name="raw_text_vector", dtype=DataType.FLOAT_VECTOR, dim=DIMENSION)
]

# Create the CollectionSchema object
schema = CollectionSchema(
    fields,
    description="A collection to store text documents and their embeddings for semantic search."
)

# Check if the collection already exists and drop it if so, to ensure a clean start
if client.has_collection(collection_name=COLLECTION_NAME):
    client.drop_collection(collection_name=COLLECTION_NAME)
    print(f"Dropped existing collection: {COLLECTION_NAME}")

# Create the collection in Milvus
client.create_collection(
    collection_name=COLLECTION_NAME,
    schema=schema,
    auto_id=True
)

print(f"Collection '{COLLECTION_NAME}' created successfully! 📝")

# Verify collection creation
print("\nCollections in Milvus:")
print(client.list_collections())

DEBUG:pymilvus.milvus_client.milvus_client:Successfully created collection: milvus_training_collection


Dropped existing collection: milvus_training_collection
Collection 'milvus_training_collection' created successfully! 📝

Collections in Milvus:
['milvus_training_collection']


### Step 3: Generate Embeddings and Insert Data ➕

Before inserting text into Milvus, we need to convert it into **vector embeddings**. We'll use a pre-trained `SentenceTransformer` model for this.

In [None]:
# Load a pre-trained sentence embedding model
# 'all-MiniLM-L6-v2' is a good balance of performance and size for demos.
model = SentenceTransformer('all-MiniLM-L6-v2')
print("Sentence embedding model loaded. 🧠")

hotel_descriptions = [
    {
        "text": "Experience unparalleled luxury at The Grand Monarch Hotel, where every detail is crafted for your supreme comfort. Our spacious suites offer panoramic city views, plush bedding, and state-of-the-art amenities. Indulge in gourmet dining, unwind at the rooftop infinity pool, or rejuvenate at our exclusive spa. Located in the heart of the city's vibrant district, perfect for both business and leisure.",
        "category": "luxury"
    },
    {
        "text": "Welcome to Serenity Suites, a boutique hotel designed for ultimate relaxation. Nestled amidst lush gardens, it features elegantly appointed rooms with private balconies. Enjoy farm-to-table dining, a serene yoga studio, and personalized service. A tranquil escape just minutes from the bustling city center.",
        "category": "luxury"
    },
    {
        "text": "The Business Elite Hotel offers premium accommodations with a focus on productivity and comfort. Each room includes a dedicated workspace, high-speed internet, and ergonomic chairs. Our facilities boast multiple meeting rooms, a 24/7 fitness center, and express check-in/check-out. Ideal for the discerning corporate traveler.",
        "category": "mid-range"
    },
    {
        "text": "Paradise Beach Resort is your gateway to an unforgettable beachfront vacation. Enjoy direct access to pristine sands, crystal-clear waters, and a variety of water sports. Our family-friendly resort features multiple pools, kids' clubs, and diverse dining options. Spacious ocean-view bungalows await.",
        "category": "luxury"
    },
    {
        "text": "Artisan Hotel, a modern marvel blending contemporary design with exceptional service. Located in the cultural quarter, steps away from galleries and theaters. Rooms feature unique art pieces, smart home technology, and rain showers. Enjoy our trendy lounge bar and a curated breakfast.",
        "category": "mid-range"
    },
    {
        "text": "City Stay Inn provides comfortable and affordable lodging in a convenient downtown location. Rooms are clean and include basic amenities like Wi-Fi and a flat-screen TV. We offer a continental breakfast and easy access to public transportation and local attractions.",
        "category": "budget"
    },
    {
        "text": "The Travelers Rest offers a no-frills, functional stay for budget-conscious visitors. Our rooms are modest but clean, with essential comforts. Free parking and a small diner on-site. Good for an overnight stop.",
        "category": "budget"
    },
    {
        "text": "Riverside Hotel offers standard rooms with river views, located slightly outside the main city hustle. Features include a modest restaurant and a small gym. A practical choice for those seeking quietude.",
        "category": "mid-range"
    },
    {
        "text": "Budget Lodge is a very basic accommodation suitable for short stays. Rooms are sparse and show some wear. Limited amenities, and the area can be noisy at night. Parking is a challenge.",
        "category": "poor"
    },
    {
        "text": "The Old Inn offers outdated rooms with minimal comfort. Expect worn furnishings and a strong musty smell. Maintenance is clearly an issue. Not recommended for extended stays.",
        "category": "poor"
    }
]

# Generate embeddings for the documents
documents = [doc["text"] for doc in hotel_descriptions]
document_embeddings = model.encode(documents).tolist()
print(f"Generated {len(document_embeddings)} embeddings. 📊")

# Prepare data for insertion: a list of dictionaries, each representing an entity (row)
# The keys in the dictionary must match the field names in our collection schema.
data_to_insert = []
for i, (hotel_desc, embedding) in enumerate(zip(hotel_descriptions, document_embeddings)):
    data_to_insert.append({
       # "id": i + 1, # Manual ID if you want to control the id creation
        "raw_text": hotel_desc['text'],
        "category": hotel_desc['category'],
        "raw_text_vector": embedding
    })

# Insert data into the collection
insert_result = client.insert(
    collection_name=COLLECTION_NAME,
    data=data_to_insert
)

print(f"\nInserted {insert_result['insert_count']} entities. IDs: {insert_result['ids']} ✨")

# Get collection statistics to verify insertions
stats = client.get_collection_stats(collection_name=COLLECTION_NAME)
print(f"\nCollection '{COLLECTION_NAME}' has {stats['row_count']} entities. 📈")

Sentence embedding model loaded. 🧠
Generated 10 embeddings. 📊

Inserted 10 entities. IDs: [459496067010396200, 459496067010396201, 459496067010396202, 459496067010396203, 459496067010396204, 459496067010396205, 459496067010396206, 459496067010396207, 459496067010396208, 459496067010396209] ✨

Collection 'milvus_training_collection' has 10 entities. 📈


### Step 4: Create an Index & Load Collection 🚀

To enable fast similarity search, Milvus needs an **index** on the vector field. After creating the index, the collection must be **loaded** into memory for queries.

In [None]:
# Define index parameters
# We'll use an IVF_FLAT index, which is good for many common scenarios.
# 'nlist' is the number of clusters for IVF_FLAT.
# 'L2' (Euclidean distance) is a common metric type.
index_params = client.prepare_index_params()
index_params.add_index(
    field_name="raw_text_vector",
    index_type="IVF_FLAT",
    metric_type="COSINE", # Other options: "IP" (Inner Product), "COSINE"
    params={"nlist": 128} # Number of clusters for IVF_FLAT
)

# Create the index
client.create_index(
    collection_name=COLLECTION_NAME,
    index_params=index_params
)

print(f"Index created on collection '{COLLECTION_NAME}' for 'raw_text_vector' field. 🔑")

# Load the collection into memory for searching
client.load_collection(collection_name=COLLECTION_NAME)
print(f"Collection '{COLLECTION_NAME}' loaded into memory. 💾")

DEBUG:pymilvus.milvus_client.milvus_client:Successfully created an index on collection: milvus_training_collection


Index created on collection 'milvus_training_collection' for 'raw_text_vector' field. 🔑
Collection 'milvus_training_collection' loaded into memory. 💾


### Step 5: Perform Similarity Search 🔍

Now that our data is inserted and indexed, we can perform **similarity searches**. We'll take a query text, convert it into an embedding, and then ask Milvus to find the most similar documents.

In [None]:
# Define a query text
query_text = "Help me identify poor hotels with low quality amenities"
# I'm looking for a luxury hotel with excellent service and top-notch amenities.
# Looking for the cheapest hotels, just a place to sleep.
# Show me budget-friendly accommodations for a short, no-frills stay
# Suggest a comfortable and premium hotel ideal for business travelers
# Help me identify poor hotels with low quality amenities

# Generate the embedding for the query text
query_embedding = model.encode([query_text]).tolist()
print(f"Generated embedding for query: '{query_text}' 💬")

# Perform the search
# 'data': The query vector(s)
# 'limit': Number of top-K results to return
# 'output_fields': Which scalar fields to retrieve along with the results
# 'search_params': Parameters specific to the index type (e.g., 'nprobe' for IVF_FLAT)
search_results = client.search(
    collection_name=COLLECTION_NAME,
    data=query_embedding,
    limit=3, # Retrieve top 3 most similar results
    output_fields=["raw_text"], # Get the original text back
    search_params={"metric_type": "COSINE", "params": {"nprobe": 10}}
)

print("\nSearch Results: 🎯")
for hit in search_results[0]: # search_results[0] because we sent one query vector
   # print(f"  ID: {hit.id}")
   # print(f"  Distance (L2): {hit.distance:.4f}") # Lower L2 distance means higher similarity
    print(f"  Text: {hit['entity']['raw_text']}")
    print(f"  Text: {hit['distance']}")
    print("-" * 20)

Generated embedding for query: 'Help me identify poor hotels with low quality amenities' 💬

Search Results: 🎯
  Text: Budget Lodge is a very basic accommodation suitable for short stays. Rooms are sparse and show some wear. Limited amenities, and the area can be noisy at night. Parking is a challenge.
  Text: 0.486139714717865
--------------------
  Text: The Travelers Rest offers a no-frills, functional stay for budget-conscious visitors. Our rooms are modest but clean, with essential comforts. Free parking and a small diner on-site. Good for an overnight stop.
  Text: 0.46404412388801575
--------------------
  Text: The Business Elite Hotel offers premium accommodations with a focus on productivity and comfort. Each room includes a dedicated workspace, high-speed internet, and ergonomic chairs. Our facilities boast multiple meeting rooms, a 24/7 fitness center, and express check-in/check-out. Ideal for the discerning corporate traveler.
  Text: 0.45085960626602173
-------------------

### Step 6: (Optional) Hybrid Search with Metadata Filtering 🧩 + 🔍

Milvus allows you to combine vector similarity search with **metadata filtering**. This lets you narrow down your search results based on specific attribute values, leading to more precise and relevant outcomes.

In [None]:

# Perform a search with metadata filter
query_text_filtered = "Find a very confortable hotel"
query_embedding_filtered = model.encode([query_text_filtered]).tolist()

# The 'filter' parameter uses a boolean expression on scalar fields
category = "luxury"
filtered_search_results = client.search(
    collection_name=COLLECTION_NAME,
    data=query_embedding_filtered,
    limit=3,
    output_fields=["raw_text", "category"],
    filter=f"category == '{category}'", # Filter results to only include documents where category matches
    search_params={"metric_type": "COSINE", "params": {"nprobe": 10}}
)

print(f"\nHybrid Search Results (query: '{query_text_filtered}', filtered by category='{category}'): 🎯")
for hit in filtered_search_results[0]:
    print(f"  ID: {hit['id']}")
    print(f"  Text: {hit['entity']['raw_text']}")
    print(f"  category: {hit['entity']['category']}")
    print(f"  Distance: {hit['distance']}")
    print("-" * 20)



Hybrid Search Results (query: 'Find a very confortable hotel', filtered by category='luxury'): 🎯
  ID: 459496067010396201
  Text: Welcome to Serenity Suites, a boutique hotel designed for ultimate relaxation. Nestled amidst lush gardens, it features elegantly appointed rooms with private balconies. Enjoy farm-to-table dining, a serene yoga studio, and personalized service. A tranquil escape just minutes from the bustling city center.
  category: luxury
  Distance: 0.5233446955680847
--------------------
  ID: 459496067010396200
  Text: Experience unparalleled luxury at The Grand Monarch Hotel, where every detail is crafted for your supreme comfort. Our spacious suites offer panoramic city views, plush bedding, and state-of-the-art amenities. Indulge in gourmet dining, unwind at the rooftop infinity pool, or rejuvenate at our exclusive spa. Located in the heart of the city's vibrant district, perfect for both business and leisure.
  category: luxury
  Distance: 0.4614187180995941
-----

### Step 7: Clean Up (Optional) 🧹

You can drop the collection and close the Milvus Lite client when you're done. This frees up resources and deletes the data file.

In [None]:
# Drop the collection
client.drop_collection(collection_name=COLLECTION_NAME)
print(f"Collection '{COLLECTION_NAME}' dropped. 🗑️")

# Close the Milvus client (important for Milvus Lite to finalize the DB file)
client.close()
print("MilvusClient closed. 🚪")

# You can also manually delete the .db file if you wish
if os.path.exists(DB_FILE_PATH):
    os.remove(DB_FILE_PATH)
    print(f"Milvus Lite database file '{DB_FILE_PATH}' removed. 👋")

### Next Steps & Further Exploration 💡

* **Experiment with different queries** and observe the search results.
* Try different `limit` values in the search function.
* Explore more advanced **index types** and their parameters (e.g., `HNSW` for even faster searches on very large datasets, though `IVF_FLAT` is often a good start).
* Learn about **partitioning** in Milvus for better data management and query performance.
* Consider **production deployments** of Milvus using Docker or Kubernetes for larger scale applications.
* Integrate Milvus into your own AI applications, like **Retrieval Augmented Generation (RAG)** systems with LLMs.

Thank you for participating! If you have any questions, feel free to ask.