# Milvus

- Author: [hellohotkey](https://github.com/hellohotkey)
- Design: 
- Peer Review: 
- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain-academy/blob/main/module-4/sub-graph.ipynb) [![Open in LangChain Academy](https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/66e9eba12c7b7688aa3dbb5e_LCA-badge-green.svg)](https://academy.langchain.com/courses/take/intro-to-langgraph/lessons/58239937-lesson-2-sub-graphs)

## Overview

`Milvus` is an open-source vector database designed for managing large-scale, unstructured data such as embedding vectors. Its core features include similarity search using distance metrics like Euclidean distance, cosine similarity, and inner product.

This tutorial will guide you through setting up a Milvus environment, creating a collection, and performing basic operations such as inserting and querying vector data. We'll use Python and Jupyter Notebook for our implementation.

**Prerequisites**

- Basic understanding of Python
- Familiarity with machine learning concepts
- Python 3.8 or higher installed
- Basic knowledge of embeddings and vector representations

### Table of Contents
- [Overview](#overview)
- [Environment Setup](#environment-setup)
- [Initialization](#initialization)
- [Creating a Collection](#creating-a-collection)
- [Setting Up the Index](#setting-up-the-index)
- [Data Insertion](#data-insertion)
- [Querying Data](#querying-data)

### References

- [Milvus vector database documentation](https://milvus.io/docs/ko)
- [Langchain Milvus](https://python.langchain.com/docs/integrations/vectorstores/milvus/)

----

## Environment Setup

Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.

**[Note]**
- `langchain-opentutorial` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials. 
- You can checkout the [`langchain-opentutorial`](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details.

In [1]:
%%capture --no-stderr
!pip install langchain-opentutorial

In [2]:
# Install required packages
from langchain_opentutorial import package

package.install(
    [
        "langsmith",
        "langchain",
        "langchain_core",
        "langchain-anthropic",
        "langchain_community",
        "langchain_text_splitters",
        "langchain_openai",
    ],
    verbose=False,
    upgrade=False,
)

In [None]:
# Set environment variables
from langchain_opentutorial import set_env

set_env(
    {
        "OPENAI_API_KEY": "",
        "LANGCHAIN_API_KEY": "",
        "LANGCHAIN_TRACING_V2": "true",
        "LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
        "LANGCHAIN_PROJECT": "Adaptive-RAG",  # title 과 동일하게 설정해 주세요
    }
)

You can alternatively set API keys such as `OPENAI_API_KEY` in a `.env` file and load them.

[Note] This is not necessary if you've already set the required API keys in previous steps.

In [None]:
# Load API keys from .env file
from dotenv import load_dotenv
load_dotenv(override=True)

## Initialization

First, install the necessary packages. This includes `milvus`, `pymilvus`, and `sentence-transformers`.

In [2]:
# Install required packages
!pip install milvus pymilvus sentence-transformers



We'll use Milvus Lite for this tutorial, which is perfect for development and testing purposes.

In [4]:
# Start Milvus server
from milvus import default_server
default_server.start()

# Establish connection
from pymilvus import connections
connections.connect(
    host="127.0.0.1",
    port=default_server.listen_port
)

### Creating a Collection
In Milvus, a collection is similar to a table in traditional databases. We'll create a collection to store our vector data.

In [5]:
from pymilvus import FieldSchema, CollectionSchema, DataType, Collection

# Define the dimension of our embedding vectors
# Using 384 as it matches the output dimension of the 'all-MiniLM-L12-v2' model
DIMENSION = 384

# Define the schema for our collection
fields = [
    FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True), # Auto-incrementing ID field
    FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=DIMENSION), # Vector field to store embeddings
    FieldSchema(name="sentence", dtype=DataType.VARCHAR, max_length=65535)
]

# Create schema with dynamic field support
schema = CollectionSchema(fields=fields, enable_dynamic_field=True)

# Initialize collection
collection = Collection(name="example_name", schema=schema)

### Setting Up the Index
Indexing is crucial for efficient similarity search in vector databases.

In [7]:
# Define index parameters
index_params = {
    "index_type": "IVF_FLAT",  # Index type for approximate nearest neighbor search
    "metric_type": "L2",       # L2 distance metric
    "params": {"nlist": 128},  # Number of cluster units
}

# Create index on the embedding field
collection.create_index(
    field_name="embedding",
    index_params=index_params
)

# Load collection into memory for faster access
collection.load()

### Data Insertion
Now we'll prepare and insert data into our collection.

In [8]:
from sentence_transformers import SentenceTransformer

# Initialize the sentence transformer model
transformer = SentenceTransformer('all-MiniLM-L12-v2')

In [9]:
# Read and process sample text
with open("data/the_little_prince.txt", "r", encoding="utf-8") as f:
    text = f.read()

In [10]:
# Split text into sentences
sentences = text.split(".")

In [11]:
# Prepare data for insertion
milvus_input = []
for sentence in sentences:
    if sentence.strip():  # Skip empty sentences
        entry = {
            "embedding": transformer.encode(sentence),
            "sentence": sentence.strip()
        }
        milvus_input.append(entry)

In [12]:
# Insert data into collection
collection.insert(milvus_input)
collection.flush()  # Ensure data is persisted

### Querying Data
Now let's try a simple similarity search using our vector database. We'll search for sentences that are similar to our query.

In [13]:
# Prepare query
query = "Can you tell me about the Little Prince's travels?"
query_embedding = transformer.encode(query)

In [14]:
# Search for similar sentences
results = collection.search(
    data=[query_embedding],      # Our query embedding
    anns_field="embedding",      # Field to search
    param={"metric_type": "L2", "params": {"nprobe": 10}},  # Simple search parameters
    limit=3,                     # Return top 3 results
    output_fields=["sentence"]   # Return the sentence field
)

In [15]:
# Print results in a simple format
print(f"Search query: {query}\n")
for hits in results:
    print("Similar sentences found:")
    for i, hit in enumerate(hits, 1):
        print(f"{i}. {hit.entity.get('sentence')}")

Search query: Can you tell me about the Little Prince's travels?

Similar sentences found:
1. It is here that the little prince appeared on Earth, and disappeared
2. " 

[ Chapter 20 ]
- the little prince discovers a garden of roses
But it happened that after walking for a long time through sand, and rocks, and snow, the little prince at last came upon a road
3. "Are they pursuing the first travelers?" demanded the little prince
