# ReadtheDocs Retrieval Augmented Generation (RAG) using Milvus Client

In this notebook, we are going to use Milvus documentation pages to create a chatbot about our product.

A chatbot is going to follow RAG steps to retrieve chunks of data using Semantic Vector Search, then the Question + Context will be fed as a Prompt to a LLM to generate an answer.

<div>
<img src="../../../images/rag_image.png" width="80%"/>
</div>

Many RAG demos use OpenAI for the Embedding Model and ChatGPT for the Generative AI model.  In this notebook, we will demo a fully open source RAG stack - open source embedding model available on HuggingFace, Milvus, and an open source LLM.

Let's get started!

In [1]:
# For colab install these libraries in this order:
# !pip install milvus, pymilvus, langchain, torch, transformers, python-dotenv, accelerate

# Import common libraries.
import time
import pandas as pd
import numpy as np

## Download Milvus documentation to a local directory.

In [2]:
# # Uncomment to download readthedocs page locally.

# DOCS_PAGE="https://pymilvus.readthedocs.io/en/latest/"
# !echo $DOCS_PAGE

# # Specify encoding to handle non-unicode characters in documentation.
# !wget -r -A.html -P rtdocs --header="Accept-Charset: UTF-8" $DOCS_PAGE

## Start up a local Milvus server.

Code in this notebook uses fully-managed Milvus on [Ziliz Cloud free trial](https://cloud.zilliz.com/login).  Choose the default "Starter" option when you provision > Create collection > Give it a name > Create cluster and collection.
- pip install pymilvus

💡 **For production purposes**, use a local Milvus docker, Milvus clusters, or fully-managed Milvus on Zilliz Cloud.
- [Local Milvus docker](https://milvus.io/docs/install_standalone-docker.md) requires local docker installed and running.
- [Milvus clusters](https://milvus.io/docs/install_cluster-milvusoperator.md) requires a K8s cluster up and running.
- [Milvus client](https://milvus.io/docs/using_milvusclient.md) with [Milvus lite](https://milvus.io/docs/milvus_lite.md), which runs a local server.  ⛔️ Milvus lite is only meant for demos and local testing.

💡 Note: To keep your tokens private, best practice is to use an env variable.
In Jupyter, need .env file (in same dir as notebooks) containing lines like this:
- VARIABLE_NAME=value


In [3]:
from pymilvus import connections, utility

import os
from dotenv import load_dotenv
load_dotenv()
TOKEN = os.getenv("ZILLIZ_API_KEY")

# Connect to Zilliz cloud.
CLUSTER_ENDPOINT="https://in03-e3348b7ab973336.api.gcp-us-west1.zillizcloud.com:443"
connections.connect(
  alias='default',
  #  Public endpoint obtained from Zilliz Cloud
  uri=CLUSTER_ENDPOINT,
  # API key or a colon-separated cluster username and password
  token=TOKEN,
)

# Check if the server is ready and get colleciton name.
print(f"Type of server: {utility.get_server_version()}")

Type of server: zilliz_cloud


## Load the Embedding Model checkpoint and use it to create vector embeddings
**Embedding model:**  We will use the open-source [sentence transformers](https://www.sbert.net/docs/pretrained_models.html) available on HuggingFace to encode the documentation text.  We will download the model from HuggingFace and run it locally.  We'll save the model's generated embeedings to a pandas dataframe and then into the milvus database.

Two model parameters of note below:
1. EMBEDDING_LENGTH refers to the dimensionality or length of the embedding vector. In this case, the embeddings generated for EACH token in the input text will have the SAME length = 768. This size of embedding is often associated with BERT-based models, where the embeddings are used for downstream tasks such as classification, question answering, or text generation. <br><br>
2. MAX_SEQ_LENGTH is the maximum length the encoder model can handle for input sequences. In this case, if sequences longer than 512 tokens are given to the model, everything longer will be (silently!) chopped off.  This is the reason why a chunking strategy is needed to segment input texts into chunks with lengths that will fit in the model's input.

In [4]:
# Import torch.
import torch
from torch.nn import functional as F
from sentence_transformers import SentenceTransformer

# Initialize torch settings
torch.backends.cudnn.deterministic = True
DEVICE = torch.device('cuda:3' if torch.cuda.is_available() else 'cpu')
print(f"device: {DEVICE}")

# Load the model from huggingface model hub.
model_name = "BAAI/bge-base-en-v1.5"
encoder = SentenceTransformer(model_name, device=DEVICE)
print(type(encoder))
print(encoder)

# Get the model parameters and save for later.
MAX_SEQ_LENGTH = encoder.get_max_seq_length() 
HF_EOS_TOKEN_LENGTH = 1
EMBEDDING_LENGTH = encoder.get_sentence_embedding_dimension()

# Inspect model parameters.
print(f"model_name: {model_name}")
print(f"EMBEDDING_LENGTH: {EMBEDDING_LENGTH}")
print(f"MAX_SEQ_LENGTH: {MAX_SEQ_LENGTH}")

device: cpu
<class 'sentence_transformers.SentenceTransformer.SentenceTransformer'>
SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': True}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)
model_name: BAAI/bge-base-en-v1.5
EMBEDDING_LENGTH: 768
MAX_SEQ_LENGTH: 512


## Create a Milvus collection

You can think of a collection in Milvus like a "table" in SQL databases.  The **collection** will contain the 
- **Schema** (or no-schema Milvus Client).  
💡 You'll need the vector `EMBEDDING_LENGTH` parameter from your embedding model.
- **Vector index** for efficient vector search
- **Vector distance metric** for measuring nearest neighbor vectors
- **Consistency level**
In Milvus, transactional consistency is possible; however, according to the [CAP theorem](https://en.wikipedia.org/wiki/CAP_theorem), some latency must be sacrificed. 💡 Searching movie reviews is not mission-critical, so [`eventually`](https://milvus.io/docs/consistency.md) consistent is fine here.

Some supported [data types](https://milvus.io/docs/schema.md) for Milvus schemas are:
- INT64 - primary key
- VARCHAR - raw texts
- FLOAT_VECTOR - embedings = list of `numpy.ndarray` of `numpy.float32` numbers

In [5]:
from pymilvus import (
    FieldSchema, DataType, 
    CollectionSchema, Collection)

# 1. Name your collection.
COLLECTION_NAME = "MIlvusDocs"

# 2. Use embedding length from the embedding model.
print(f"Embedding length: {EMBEDDING_LENGTH}")

# 3. Define minimum required fields.
fields = [
    FieldSchema(name="pk", dtype=DataType.INT64, is_primary=True, auto_id=True),
    FieldSchema(name="vector", dtype=DataType.FLOAT_VECTOR, dim=EMBEDDING_LENGTH),
]

# 4. Create schema with dynamic field enabled.
schema = CollectionSchema(
		fields,
		description="The schema for docs pages",
		enable_dynamic_field=True
)
mc = Collection(COLLECTION_NAME, schema, consistency_level="Eventually")

print(f"Created collection: {COLLECTION_NAME}")
print(f"Schema: {mc.schema}")

Embedding length: 768
Created collection: MIlvusDocs
Schema: {'auto_id': True, 'description': 'The schema for docs pages', 'fields': [{'name': 'pk', 'description': '', 'type': <DataType.INT64: 5>, 'is_primary': True, 'auto_id': True}, {'name': 'vector', 'description': '', 'type': <DataType.FLOAT_VECTOR: 101>, 'params': {'dim': 768}}], 'enable_dynamic_field': True}


## Add a Vector Index

The vector index determines the vector **search algorithm** used to find the closest vectors in your data to the query a user submits.  Most vector indexes use different sets of parameters depending on whether the database is:
- **inserting vectors** (creation mode) - vs - 
- **searching vectors** (search mode) 

Scroll down the [docs page](https://milvus.io/docs/index.md) to see a table listing different vector indexes available on Milvus.  For example:
- FLAT - deterministic exhaustive search
- IVF_FLAT or IVF_SQ8 - Hash index (stochastic approximate search)
- HNSW - Graph index (stochastic approximate search)
- AUTOINDEX - Automatically determined by Milvus based on local vs cloud, type of GPU, size of data.

Besides a search algorithm, we also need to specify a **distance metric**, that is, a definition of what is considered "close" in vector space.  In the cell below, the [`HNSW`](https://github.com/nmslib/hnswlib/blob/master/ALGO_PARAMS.md) search index is chosen.  Its possible distance metrics are one of:
- L2 - L2-norm
- IP - Dot-product
- COSINE - Angular distance

💡 Most use cases work better with normalized embeddings, in which case L2 is useless (every vector has length=1) and IP and COSINE are the same.  Only choose L2 if you plan to keep your embeddings unnormalized.

In [6]:
# Add a default search index to the collection.

# Drop the index, in case it already exists.
mc.drop_index()

index_params = {
    "index_type": "AUTOINDEX",
    "metric_type": "COSINE", 
    # No params for AUTOINDEX
    # "params": {}
    }

# Specify column name which contains the vector.
mc.create_index(
    field_name="vector", 
    index_params=index_params)

Status(code=0, message=)

In [7]:
## Read docs into LangChain
#!pip install langchain 
from langchain.document_loaders import ReadTheDocsLoader

loader = ReadTheDocsLoader("rtdocs/pymilvus.readthedocs.io/en/latest/", features="html.parser")
docs = loader.load()

num_documents = len(docs)
print(f"loaded {num_documents} documents")

loaded 15 documents


## Chunking

Before embedding, it is necessary to decide your chunk strategy, chunk size, and chunk overlap.  In this demo, I will use:
- **Strategy** = Use markdown header hierarchies.  Split markdown sections if too long.
- **Chunk size** = Use the embedding model's parameter `MAX_SEQ_LENGTH`
- **Overlap** = Rule-of-thumb 10-15%
- **Function** = 
  - Langchain's `HTMLHeaderTextSplitter` to split markdown sections.
  - Langchain's `RecursiveCharacterTextSplitter` to split up long reviews recursively.


In [8]:
from langchain.text_splitter import HTMLHeaderTextSplitter, RecursiveCharacterTextSplitter

# Define the headers to split on for the HTMLHeaderTextSplitter
headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3"),
]
# Create an instance of the HTMLHeaderTextSplitter
html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)

# Use the embedding model parameters.
chunk_size = MAX_SEQ_LENGTH - HF_EOS_TOKEN_LENGTH
chunk_overlap = np.round(chunk_size * 0.10, 0)

# Create an instance of the RecursiveCharacterTextSplitter
child_splitter = RecursiveCharacterTextSplitter(
    chunk_size = chunk_size,
    chunk_overlap = chunk_overlap,
    length_function = len,
)

# Split the HTML text using the HTMLHeaderTextSplitter.
start_time = time.time()
html_header_splits = []
for doc in docs:
    splits = html_splitter.split_text(doc.page_content)
    for split in splits:
        # Add the source URL and header values to the metadata
        metadata = {}
        new_text = split.page_content
        for header_name, metadata_header_name in headers_to_split_on:
            header_value = new_text.split("¶ ")[0].strip()
            metadata[header_name] = header_value
            try:
                new_text = new_text.split("¶ ")[1].strip()
            except:
                break
        split.metadata = {
            **metadata,
            "source": doc.metadata["source"]
        }
        # Add the header to the text
        split.page_content = split.page_content
    html_header_splits.extend(splits)

# Split the documents further into smaller, recursive chunks.
chunks = child_splitter.split_documents(html_header_splits)

end_time = time.time()
print(f"chunking time: {end_time - start_time}")
print(f"docs: {len(docs)}, split into: {len(html_header_splits)}")
print(f"split into chunks: {len(chunks)}, type: list of {type(chunks[0])}") 

# Inspect chunks.
print()
print("Looking at a sample chunk...")
print(chunks[1].metadata)
print(chunks[1].page_content[:100])

# TODO - remove this before saving in github.
# # Print the child splits with their associated header metadata
# print()
# for child in chunks:
#     print(f"Content: {child.page_content}")
#     print(f"Metadata: {child.metadata}")
#     print()

chunking time: 0.01832103729248047
docs: 15, split into: 15
split into chunks: 159, type: list of <class 'langchain.schema.document.Document'>

Looking at a sample chunk...
{'h1': 'Installation', 'h2': 'Installing via pip', 'source': 'rtdocs/pymilvus.readthedocs.io/en/latest/install.html'}
demonstrate how to install and using PyMilvus in a virtual environment. See virtualenv for more info


In [9]:
# Clean up the metadata urls
for doc in chunks:
    new_url = doc.metadata["source"]
    new_url = new_url.replace("rtdocs", "https:/")
    doc.metadata.update({"source": new_url})

print(chunks[0].metadata)
print(chunks[0].page_content[:100])

{'h1': 'Installation', 'h2': 'Installing via pip', 'source': 'https://pymilvus.readthedocs.io/en/latest/install.html'}
Installation¶ Installing via pip¶ PyMilvus is in the Python Package Index. PyMilvus only support pyt


## Insert data into Milvus

Milvus and Milvus Lite support loading pandas dataframes directly.

Milvus Client, however, requires conerting pandas df into a list of dictionaries first.


In [10]:
# Convert chunks and embeddings to a list of dictionaries.
chunk_list = []
for chunk in chunks:
    embeddings = torch.tensor(encoder.encode([chunk.page_content]))
    embeddings = F.normalize(embeddings, p=2, dim=1)
    converted_values = list(map(np.float32, embeddings))[0]
    
    # Only use h1, h2. Truncate the metadata in case too long.
    try:
        h2 = chunk.metadata['h2'][:50]
    except:
        h2 = ""
    chunk_dict = {
        'vector': converted_values,
        'chunk': chunk.page_content,
        'source': chunk.metadata['source'],
        'h1': chunk.metadata['h1'][:50],
        'h2': h2,
    }
    chunk_list.append(chunk_dict)

# # TODO - remove this before saving in github.
# for chunk in chunk_list[:1]:
#     print(chunk)

In [11]:
# Insert a batch of data into the Milvus collection.

print("Start inserting entities")
start_time = time.time()
insert_result = mc.insert(chunk_list)

end_time = time.time()
print(f"Milvus insert time for {len(chunk_list)} vectors: {end_time - start_time} seconds")

# After final entity is inserted, call flush to stop growing segments left in memory.
mc.flush() 

# Inspect results.
print(insert_result)
print(mc.partitions) # list[Partition] objects


Start inserting entities
Milvus insert time for 159 vectors: 0.9112908840179443 seconds
(insert count: 159, delete count: 0, upsert count: 0, timestamp: 445785021399957506, success count: 159, err count: 0)
[{"name":"_default","collection_name":"MIlvusDocs","description":""}]


## Run a Semantic Search

Now we can search all the documentation embeddings to find the `TOP_K` documentation chunks with the closest embeddings to a user's query.
- In this example, we'll ask about AUTOINDEX.

💡 The same model should always be used for consistency for all the embeddings.

## Ask a question about your data

So far in this demo notebook: 
1. Your custom data has been mapped into a vector embedding space
2. Those vector embeddings have been saved into a vector database

Next, you can ask a question about your custom data!

💡 In LLM lingo:
> **Query** is the generic term for user questions.  
A query is a list of multiple individual questions, up to maybe 1000 different questions!

> **Question** usually refers to a single user question.  
In our example below, the user question is "What is AUTOINDEX in Milvus Client?"

In [12]:
# Define a sample question about your data.
question = "what is the default distance metric used in AUTOINDEX?"
query = [question]

# Inspect the length of the query.
QUERY_LENGTH = len(query[0])
print(f"query length: {QUERY_LENGTH}")

query length: 54


## Execute a vector search

Search Milvus using [PyMilvus API](https://milvus.io/docs/search.md).

💡 By their nature, vector searches are "semantic" searches.  For example, if you were to search for "leaky faucet": 
> **Traditional Key-word Search** - either or both words "leaky", "faucet" would have to match some text in order to return a web page or link text to the document.

> **Semantic search** - results containing words "drippy" "taps" would be returned as well because these words mean the same thing even though they are different words,

In [13]:
# RETRIEVAL USING MILVUS.

# Before conducting a search based on a query, you need to load the data into memory.
mc.load()
print("Loaded milvus collection into memory.")

# Embed the question using the same embedding model.
embedded_question = torch.tensor(encoder.encode([question]))
# Normalize embeddings to unit length.
embedded_question = F.normalize(embedded_question, p=2, dim=1)
# Convert the embeddings to list of list of np.float32.
embedded_question = list(map(np.float32, embedded_question))

# Return top k results with AUTOINDEX.
TOP_K = 5

# Run semantic vector search using your query and the vector database.
start_time = time.time()
results = mc.search(
    data=embedded_question, 
    anns_field="vector", 
    # No params for AUTOINDEX
    param={},
    # Access dynamic fields in the boolean expression.
    # expr="",
    output_fields=["h1", "h2", "chunk", "source"], 
    limit=TOP_K,
    consistency_level="Eventually"
    )

elapsed_time = time.time() - start_time
print(f"Milvus search time: {elapsed_time} sec")

# Inspect search result.
print(f"type: {type(results)}, count: {len(results[0])}")


Loaded milvus collection into memory.
Milvus search time: 0.22196269035339355 sec
type: <class 'pymilvus.client.abstract.SearchResult'>, count: 5


## Assemble and inspect the search result

The search result is in the variable `result[0]` of type `'pymilvus.orm.search.SearchResult'`.  

In [14]:
# # TODO - remove this before saving in github.
# for n, hits in enumerate(results):
#     print(f"{n}th query result")
#     for hit in hits:
#         print(hit)

# Assemble the context as a stuffed string.
context = ""
for r in results[0]:
    text = r.entity.chunk
    context += f"{text} "
print(len(context))

2267


## Use an LLM to Generate a chat response to the user's question using the Retrieved Context.

Below, we're using an open, very tiny generative AI model, or LLM.  Many demos use OpenAI as the LLM choice instead.

In [15]:
# BASELINING THE LLM: ASK A QUESTION WITHOUT ANY RETRIEVED CONTEXT.

from transformers import AutoModelForQuestionAnswering, AutoTokenizer, pipeline

# Load the Hugging Face auto-regressive LLM checkpoint.
llm = "deepset/tinyroberta-squad2"
tokenizer = AutoTokenizer.from_pretrained(llm)

# context cannot be empty so just put random text in it.
QA_input = {
    'question': question,
    'context': 'The quick brown fox jumped over the lazy dog'
}

nlp = pipeline('question-answering', 
               model=llm, 
               tokenizer=tokenizer)

result = nlp(QA_input)
print(f"Question: {question}")
print(f"Answer: {result['answer']}")

# The baseline LLM chat is not very helpful.

Question: what is the default distance metric used in AUTOINDEX?
Answer: lazy dog


In [16]:
# NOW ASK THE SAME LLM THE SAME QUESTION USING THE RETRIEVED CONTEXT.
QA_input = {
    'question': question,
    'context': context,
}

nlp = pipeline('question-answering', 
               model=llm, 
               tokenizer=tokenizer)

result = nlp(QA_input)
print(f"Question: {question}")
print(f"Answer: {result['answer']}")

# That answer looks a little better!

Question: what is the default distance metric used in AUTOINDEX?
Answer: MetricType.L2


In [17]:
# 9. Drop collection
utility.drop_collection(COLLECTION_NAME)

In [18]:
# Props to Sebastian Raschka for this handy watermark.
# !pip install watermark

%load_ext watermark
%watermark -a 'Christy Bergman' -v -p torch,transformers,milvus,pymilvus,langchain --conda

Author: Christy Bergman

Python implementation: CPython
Python version       : 3.10.12
IPython version      : 8.15.0

torch       : 2.0.1
transformers: 4.34.1
milvus      : 2.3.3
pymilvus    : 2.3.3
langchain   : 0.0.322

conda environment: py310

