# Full-Text Search Implementation with Milvus
# Introduction

This notebook demonstrates how to implement full-text search using Milvus, a powerful vector database.

**Full text search** simplifies the process of text-based searching by eliminating the need for manual embedding generation. This feature operates through the following workflow:

**Text input**: You insert raw text documents or provide query text without needing to manually embed them.

**Text analysis**: Milvus uses an analyzer to tokenize the input text into individual, searchable terms.

**Function processing**: The built-in function receives tokenized terms and converts them into sparse vector representations.

**Collection store**: Milvus stores these sparse embeddings in a collection for efficient retrieval.

**BM25 scoring**: During a search, Milvus applies the BM25 algorithm to calculate scores for the stored documents and ranks matched results based on their relevance to the query text.

Key components we'll cover:
1. Setting up Milvus connection and schema
2. Creating and inserting sample documents
3. Implementing search functionality
4. Demonstrating search with example queries



# Section 1: Import Dependencies

First, we'll import all necessary libraries:
- pymilvus: For interacting with the Milvus database
- sentence_transformers: For text embeddings (though we'll focus on BM25 in this example)
- logging: For proper error tracking
- pandas: For organizing and displaying results


In [1]:
from pymilvus import MilvusClient, DataType, Function, FunctionType
from sentence_transformers import SentenceTransformer
import logging
import pandas as pd

In [2]:
# Set up logging configuration
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)


# Section 2: Initialize Connections

Here we initialize our connection to Milvus and set up the embedding model.
Note: Replace the connection parameters with your actual Milvus server details.


In [3]:
# Initialize Milvus client
client = MilvusClient(
    uri = "http://<host>:<port>",  # Construct URI from host and port
    user = "<user>",
    password = "<password>",
    secure=True,
    server_pem_path='<path_of_cert>',
    server_name='<servername>',
)

# Section 3: Sample Data

For demonstration purposes, we'll create a set of sample documents related to
machine learning and NLP. In a real application, you would replace these with
your actual documents.


In [4]:

sample_documents = [
    "Document 1: Introduction to Natural Language Processing and its applications",
    "Document 2: Machine Learning algorithms for text classification",
    "Document 3: Deep Learning approaches in NLP",
    "Document 4: Understanding word embeddings and their importance",
    "Document 5: Text preprocessing techniques in NLP",
    "Document 6: Vector databases and their role in modern search systems",
    "Document 7: Semantic search implementations using deep learning",
    "Document 8: BM25 algorithm explained in detail",
    "Document 9: Comparing different text similarity metrics",
    "Document 10: Best practices for implementing full-text search"
]


# Section 4: Schema Definition

To enable full text search, create a collection with a specific schema. This schema must include three necessary fields:

The primary field that uniquely identifies each entity in a collection.
- id: A unique identifier for each document

A VARCHAR field that stores raw text documents, with the enable_analyzer attribute set to True. This allows Milvus to tokenize text into specific terms for function processing.
- text: The actual document content

A SPARSE_FLOAT_VECTOR field reserved to store sparse embeddings that Milvus will automatically generate for the VARCHAR field.
- sparse: The BM25 vector representation of the text


In [5]:
def create_schema(collection_name):
    schema = client.create_schema()
    schema.add_field(field_name="id", datatype=DataType.INT64, is_primary=True, auto_id=True)
    schema.add_field(field_name="text", datatype=DataType.VARCHAR, max_length=1000, enable_analyzer=True)
    schema.add_field(field_name="sparse", datatype=DataType.SPARSE_FLOAT_VECTOR)
    
    # Add BM25 function for text search that will convert your text into sparse vector representations and then add it to the schema:
    bm25_function = Function(
        name="text_bm25_emb",
        input_field_names=["text"],
        output_field_names=["sparse"],
        function_type=FunctionType.BM25,
    )
    schema.add_function(bm25_function)
    return schema

# Section 5: Index Configuration

The index parameters define how Milvus will index our data for efficient search.
Here we set up a sparse inverted index optimized for BM25 search.


In [6]:

def create_index_params():
    index_params = client.prepare_index_params()
    index_params.add_index(
        field_name="sparse",
        index_type="SPARSE_INVERTED_INDEX", 
        metric_type="BM25"
    )
    return index_params


# Section 6: Collection Setup

This section handles the creation of the collection and data insertion.
It includes error handling and checks for existing collections.


In [7]:
def setup_collection(collection_name, documents):
   
    try:
        # Check if collection exists and drop it
        if collection_name in client.list_collections():
            print(f"Dropping existing collection: {collection_name}")
            client.drop_collection(collection_name)
        
        # Create new collection
        schema = create_schema(collection_name)
        index_params = create_index_params()
        
        client.create_collection(
            collection_name=collection_name,
            schema=schema,
            index_params=index_params
        )
        
        # Prepare and insert data
        documents_to_insert = [{'text': doc} for doc in documents]
        client.insert(collection_name, documents_to_insert)
        
    except Exception as e:
        print(f"Error setting up collection: {str(e)}")
        raise



In [8]:
def perform_search(collection_name, query_text, top_k=3):
    try:
        search_params = {
            "params": {
                "drop_ratio_search": 0.2,
                "nprobe": 10
            }
        }
        
        results = client.search(
            collection_name=collection_name,
            data=[query_text],
            anns_field="sparse",
            limit=top_k,
            output_fields=["text"],
            search_params=search_params,
        )
        
        # Process and deduplicate results
        if results and len(results) > 0:
            seen_texts = set()
            deduplicated_results = []
            
            for hit in results[0]:
                text = hit.get("entity", {}).get("text", "")
                if text not in seen_texts:
                    seen_texts.add(text)
                    deduplicated_results.append({
                        'text': text,
                        'distance': hit.get('distance', 0.0)
                    })
            
            return pd.DataFrame(deduplicated_results)
        return pd.DataFrame(columns=['text', 'distance'])
    
    except Exception as e:
        print(f"Search error: {str(e)}")
        raise


## Now let's demonstrate the search functionality with example queries.
We'll create a collection, insert our sample documents, and run various searches.

In [10]:
# Create and populate collection
collection_name = "demo_search"
setup_collection(collection_name, sample_documents)

# Example queries to demonstrate different search scenarios
print("\nPerforming Example Searches")
print("=" * 50)

queries = [
    "What is natural language processing ?",
    "What are search algorithms?",
    "Tell me about deep learning.",
    "What is text preprocessing? "
]

# Display results for each query
for query in queries:
    print(f"\nQuery: {query}")
    print("-" * 50)
    
    results_df = perform_search(collection_name, query, top_k=3)
    
    if not results_df.empty:
        print(results_df.to_string(index=False))
    else:
        print("No results found.")

Dropping existing collection: demo_search

Performing Example Searches

Query: What is natural language processing ?
--------------------------------------------------
                                                                        text  distance
Document 1: Introduction to Natural Language Processing and its applications  5.484756

Query: What are search algorithms?
--------------------------------------------------
                                                           text  distance
Document 2: Machine Learning algorithms for text classification  2.012511
Document 7: Semantic search implementations using deep learning  1.156673
  Document 10: Best practices for implementing full-text search  1.101183

Query: Tell me about deep learning.
--------------------------------------------------
                                                           text  distance
                    Document 3: Deep Learning approaches in NLP  2.794006
Document 7: Semantic search implementat

## Conclusion
Full-text search is efficient for keyword-based retrieval and simplifies the search process by automating data preprocessing. Unlike vector-based approaches, it allows you to directly pass raw text without manually tokenizing or generating embeddings. This makes it a fast, scalable, and easy-to-use solution for applications requiring structured text search, phrase matching, and relevance ranking.