# ReadtheDocs Retrieval Augmented Generation (RAG) using Milvus Client

In this notebook, we are going to use Milvus documentation pages to create a chatbot about our product.

A chatbot is going to follow RAG steps to retrieve chunks of data using Semantic Vector Search, then the Question + Context will be fed as a Prompt to a LLM to generate an answer.

<div>
<img src="../../../images/rag_image.png" width="80%"/>
</div>

Let's get started!

In [1]:
# For colab install these libraries in this order:
# !pip install milvus, pymilvus, langchain, torch, transformers, python-dotenv

# Import common libraries.
import time
import pandas as pd
import numpy as np

## Download Milvus documentation to a local directory.

In [2]:
# # Uncomment to download readthedocs page locally.

# DOCS_PAGE="https://pymilvus.readthedocs.io/en/latest/"
# !echo $DOCS_PAGE

# # Specify encoding to handle non-unicode characters in documentation.
# !wget -r -A.html -P rtdocs --header="Accept-Charset: UTF-8" $DOCS_PAGE

## Start up a local Milvus server.

Code in this notebook uses [Milvus client](https://milvus.io/docs/using_milvusclient.md) with [Milvus lite](https://milvus.io/docs/milvus_lite.md), which runs a local server.  ⛔️ Milvus lite is only meant for demos and local testing.
- pip install milvus pymilvus

💡 **For production purposes**, use a local Milvus docker, Milvus clusters, or fully-managed Milvus on Zilliz Cloud.
- [Local Milvus docker](https://milvus.io/docs/install_standalone-docker.md) requires local docker installed and running.
- [Milvus clusters](https://milvus.io/docs/install_cluster-milvusoperator.md) requires a K8s cluster up and running.
- [Ziliz Cloud free trial](https://cloud.zilliz.com/login) choose a "free" option when you provision.


In [3]:
from milvus import default_server
from pymilvus import (
    connections, utility, 
    MilvusClient,
)

# Cleanup previous data and stop server in case it is still running.
default_server.stop()
default_server.cleanup()

# Start a new milvus-lite local server.
start_time = time.time()
default_server.start()

end_time = time.time()
print(f"Milvus server startup time: {end_time - start_time} sec")
# startup time: 5.6739208698272705

# Add wait to avoid error message from trying to connect.
time.sleep(15)

# Now you could connect with localhost and the given port.
# Port is defined by default_server.listen_port.
connections.connect(host='127.0.0.1', 
                  port=default_server.listen_port,
                  show_startup_banner=True)

# Check if the server is ready.
print(utility.get_server_version())

Milvus server startup time: 8.679949045181274 sec
v2.2-testing-20230824-68-ga34a9d606-lite


## Load the Embedding Model checkpoint and use it to create vector embeddings
**Embedding model:**  We will use the open-source [sentence transformers](https://www.sbert.net/docs/pretrained_models.html) hosted on HuggingFace to encode the documentation text.  We will save the embeddings to a pandas dataframe and then into the milvus database.

💡 Note:  To keep your tokens private, best practice is to use an env variable.   <br>
In Jupyter, need .env file (in same dir as notebooks) containing lines like this:
- VARIABLE_NAME=value

In [4]:
# Import torch.
import torch
from torch.nn import functional as F
from sentence_transformers import SentenceTransformer

# Initialize torch settings
torch.backends.cudnn.deterministic = True
RANDOM_SEED = 415
torch.manual_seed(RANDOM_SEED)
DEVICE = torch.device('cuda:3' if torch.cuda.is_available() else 'cpu')
print(f"device: {DEVICE}")

import os
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())
from huggingface_hub import login

# Login to huggingface_hub
hub_token = os.getenv("HUGGINGFACEHUB_API_TOKEN")
login(token=hub_token)

# Load the model from huggingface model hub.
model_name = "BAAI/bge-base-en-v1.5"
retriever = SentenceTransformer(model_name, device=DEVICE)
print(type(retriever))
print(retriever)

# Get the model parameters and save for later.
MAX_SEQ_LENGTH = retriever.get_max_seq_length() 
HF_EOS_TOKEN_LENGTH = 1
EMBEDDING_LENGTH = retriever.get_sentence_embedding_dimension()

# Inspect model parameters.
print(f"model_name: {model_name}")
print(f"EMBEDDING_LENGTH: {EMBEDDING_LENGTH}")
print(f"MAX_SEQ_LENGTH: {MAX_SEQ_LENGTH}")

device: cpu
Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /Users/christybergman/.cache/huggingface/token
Login successful
<class 'sentence_transformers.SentenceTransformer.SentenceTransformer'>
SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': True}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)
model_name: BAAI/bge-base-en-v1.5
EMBEDDING_LENGTH: 768
MAX_SEQ_LENGTH: 512


In [5]:
# Convert the HuggingFace embeddings to a Langchain embeddings.
from langchain.embeddings import HuggingFaceEmbeddings

model_kwargs = {"device": DEVICE}
encode_kwargs = {'normalize_embeddings': True}
lc_retriever = HuggingFaceEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs
)
type(lc_retriever)

langchain.embeddings.huggingface.HuggingFaceEmbeddings

In [6]:
## Read docs into LangChain
#!pip install langchain 
from langchain.document_loaders import ReadTheDocsLoader

loader = ReadTheDocsLoader("rtdocs/pymilvus.readthedocs.io/en/latest/", features="html.parser")
docs = loader.load()

num_documents = len(docs)
print(f"loaded {num_documents} documents")
# print(f"type: {type(docs)}, len: {len(docs)}, type: {type(docs[0])}")
# docs[0]

loaded 15 documents


## Chunking

Before embedding, it is necessary to decide your chunk strategy, chunk size, and chunk overlap.  In this demo, I will use:
- **Strategy** = Naive for now.  TODO use markdown header hierarchies.
- **Chunk size** = Use the embedding model's parameter `MAX_SEQ_LENGTH`
- **Overlap** = Rule-of-thumb 10-15%
- **Function** = Langchain's convenient `RecursiveCharacterTextSplitter` to split up long reviews recursively.


In [7]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

def recursive_splitter_wrapper(text, chunk_size):

    # Default chunk overlap is 10% chunk_size.
    chunk_overlap = np.round(chunk_size * 0.10, 0)

    # Use langchain's convenient recursive chunking method.
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        length_function=len,
    )
    
    chunks = text_splitter.split_text(text)
    return [chunk for chunk in chunks if chunk]


In [8]:
# Use the embedding model parameters to calculate chunk_size and overlap.
chunk_size = MAX_SEQ_LENGTH - HF_EOS_TOKEN_LENGTH
# Default chunk overlap is 10% chunk_size.
chunk_overlap = np.round(chunk_size * 0.10, 0)

# Use recursive splitter to chunk text.
start_time = time.time()
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = chunk_size,
    chunk_overlap = chunk_overlap,
    length_function = len,
)

chunks = text_splitter.create_documents(
    [doc.page_content for doc in docs], 
    metadatas=[doc.metadata for doc in docs])

end_time = time.time()
print(f"chunking time: {end_time - start_time}")
# print(f"type: {type(chunks)}, len: {len(chunks)}, type: {type(chunks[0])}")
print(f"type: list of {type(chunks[0])}, len: {len(chunks)}") 

print()
print("Looking at a sample chunk...")
print(chunks[0].metadata)
print(chunks[0].page_content[:100])


chunking time: 0.002679109573364258
type: list of <class 'langchain.schema.document.Document'>, len: 197

Looking at a sample chunk...
{'source': 'rtdocs/pymilvus.readthedocs.io/en/latest/install.html'}
Installation¶
Installing via pip¶
PyMilvus is in the Python Package Index.
PyMilvus only support pyt


In [9]:
# Clean up the metadata urls
for doc in chunks:
    new_url = doc.metadata["source"]
    new_url = new_url.replace("rtdocs", "https:/")
    doc.metadata.update({"source": new_url})

print(chunks[0].metadata)
print(chunks[0].page_content[:500])

{'source': 'https://pymilvus.readthedocs.io/en/latest/install.html'}
Installation¶
Installing via pip¶
PyMilvus is in the Python Package Index.
PyMilvus only support python3(>= 3.6), usually, it’s ok to install PyMilvus like below.
$ python3 -m pip install pymilvus
Installing in a virtual environment¶
It’s recommended to use PyMilvus in a virtual environment, using virtual environment allows you to avoid
installing Python packages globally which could break system tools or other projects.


## Insert data into Milvus

The code below uses the [Langchain Milvus](https://api.python.langchain.com/en/latest/_modules/langchain/vectorstores/milvus.html#Milvus) adapter.  
- Default index is AUTOINDEX. <br>
💡 AUTOINDEX works on both Milvus and Zilliz Cloud (where it is the fastest!)
- collection_name is "LangChainCollection".
- Schema is 
  - pk (str): Name of the primary key field.
  - text (str): Name of the text field.
  - vector (str): Name of the vector field. 


In [10]:
# Insert a batch of data into the Milvus collection.
from langchain.vectorstores import Milvus
MILVUS_PORT = 19530
MILVUS_HOST = "127.0.0.1"

print("Start inserting entities")
start_time = time.time()

vector_store = Milvus.from_documents(
    chunks,
    embedding=lc_retriever,
    connection_args={"host": MILVUS_HOST, 
                     "port": MILVUS_PORT},
)

end_time = time.time()
print(f"Langchain Milvus insert time for {len(chunks)} vectors: {end_time - start_time} seconds")
print(f"type: {type(vector_store)}")


Start inserting entities
Langchain Milvus insert time for 197 vectors: 12.399992942810059 seconds
type: <class 'langchain.vectorstores.milvus.Milvus'>


## Run a Semantic Search

Now we can search all the documentation embeddings to find the `TOP_K` documentation chunks with the closest embeddings to a user's query.
- In this example, we'll ask about AUTOINDEX.

💡 The same model should always be used for consistency for all the embeddings.

In [11]:
# .load() not needed when using no-schema Milvus client.

# # Before conducting a search based on a query, you need to load the data into memory.
# mc.load()
# print("Loaded milvus collection into memory.")

## Ask a question about your data

So far in this demo notebook: 
1. Your custom data has been mapped into a vector embedding space
2. Those vector embeddings have been saved into a vector database

Next, you can ask a question about your custom data!

💡 In LLM lingo:
> **Query** is the generic term for user questions.  
A query is a list of multiple individual questions, up to maybe 1000 different questions!

> **Question** usually refers to a single user question.  
In our example below, the user question is "What is AUTOINDEX in Milvus Client?"

In [54]:
# Define a sample question about your data.
question = 'What is the default AUTOINDEX in Milvus Client?'
query = [question]

# Inspect the length of the query.
QUERY_LENGTH = len(query[0])
print(f"query length: {QUERY_LENGTH}")

query length: 47


**Embed the question using the same embedding model you used earlier**

In order for vector search to work, the question itself should be embedded with the same model used to create the colleciton you want to search.

## Execute a vector search

Search Milvus using [PyMilvus API](https://milvus.io/docs/search.md).

💡 By their nature, vector searches are "semantic" searches.  For example, if you were to search for "leaky faucet": 
> **Traditional Key-word Search** - either or both words "leaky", "faucet" would have to match some text in order to return a web page or link text to the document.

> **Semantic search** - results containing words "drippy" "taps" would be returned as well because these words mean the same thing even though they are different words,

In [55]:
# RETRIEVAL USING MILVUS

start_time = time.time()
# Default search.
# docs = vector_store.similarity_search(question, k=7)
# MMR search.
# docs = vector_store.max_marginal_relevance_search(question, k=7, fetch_k=100)
# Search with metadata.  TODO: Add better filtering query!
METADATA_URL = "https://pymilvus.readthedocs.io/en/latest/_modules/milvus/client/stub.html"
docs = vector_store.similarity_search(
    question,
    k=100,
    filter={"source": METADATA_URL},
    verbose=True,
)
end_time = time.time()
print(f"Milvus query time: {end_time - start_time}")

# View the retrieval result.
print(f"source: {docs[0].metadata}")
print([doc.page_content for doc in docs])

# default unique sources
# https://pymilvus.readthedocs.io/en/latest/_modules/milvus/client/stub.html
# https://pymilvus.readthedocs.io/en/latest/param.html
# https://pymilvus.readthedocs.io/en/latest/genindex.html
# https://pymilvus.readthedocs.io/en/latest/tutorial.html

# MMR Unique Sources
# https://pymilvus.readthedocs.io/en/latest/_modules/milvus/client/stub.html
# https://pymilvus.readthedocs.io/en/latest/param.html

for d in docs:
    print(d.metadata)


Milvus query time: 0.08918285369873047
source: {'source': 'https://pymilvus.readthedocs.io/en/latest/genindex.html'}
{'source': 'https://pymilvus.readthedocs.io/en/latest/genindex.html'}
{'source': 'https://pymilvus.readthedocs.io/en/latest/_modules/milvus/client/stub.html'}
{'source': 'https://pymilvus.readthedocs.io/en/latest/param.html'}
{'source': 'https://pymilvus.readthedocs.io/en/latest/genindex.html'}
{'source': 'https://pymilvus.readthedocs.io/en/latest/_modules/milvus/client/types.html'}
{'source': 'https://pymilvus.readthedocs.io/en/latest/tutorial.html'}
{'source': 'https://pymilvus.readthedocs.io/en/latest/_modules/milvus/client/stub.html'}
{'source': 'https://pymilvus.readthedocs.io/en/latest/tutorial.html'}
{'source': 'https://pymilvus.readthedocs.io/en/latest/tutorial.html'}
{'source': 'https://pymilvus.readthedocs.io/en/latest/tutorial.html'}
{'source': 'https://pymilvus.readthedocs.io/en/latest/genindex.html'}
{'source': 'https://pymilvus.readthedocs.io/en/latest/para

## Assemble and inspect the search result

The search result is in the list variable `docs` of type `'pymilvus.orm.search.SearchResult'`.  

In [58]:
print(f"Count raw retrievals: {len(docs)}")

unique_sources = []
unique_texts = []
for doc in docs:
    if doc.metadata['source'] == METADATA_URL:
        if doc.page_content not in unique_texts:
            unique_texts.append(doc.page_content)
            unique_sources.append(doc.metadata['source'])
print(f"Count unique texts: {len(unique_texts)}")
# [ print(text) for text in unique_texts ]

# Assemble all the results in a zipped list.
formatted_context = list(zip(unique_sources, unique_texts))

# Assemble the context as a stuffed string.
context = ""
for source, text in formatted_context:
    context += f"{text} "
print(len(context))

Count raw retrievals: 100
Count unique texts: 37
17084


## Use an LLM to Generate a chat response to the user's question using the Retrieved Context.

In [39]:
# BASELINING THE LLM: ASK A QUESTION WITHOUT ANY RETRIEVED CONTEXT.

from transformers import AutoModelForQuestionAnswering, AutoTokenizer, pipeline

# Load the Hugging Face auto-regressive LLM checkpoint.
llm = "deepset/tinyroberta-squad2"
tokenizer = AutoTokenizer.from_pretrained(llm)

# context cannot be empty so just put random text in it.
QA_input = {
    'question': question,
    'context': 'The quick brown fox jumped over the lazy dog'
}

nlp = pipeline('question-answering', 
               model=llm, 
               tokenizer=tokenizer)

result = nlp(QA_input)
print(f"Question: {question}")
print(f"Answer: {result['answer']}")

# The baseline LLM chat is not very helpful.

Question: What is the default AUTOINDEX index type in Milvus Client?
Answer: lazy dog


In [57]:
# NOW ASK THE SAME LLM THE SAME QUESTION USING THE RETRIEVED CONTEXT.
QA_input = {
    'question': question,
    'context': context,
}

nlp = pipeline('question-answering', 
               model=llm, 
               tokenizer=tokenizer)

result = nlp(QA_input)
print(f"Question: {question}")
print(f"Answer: {result['answer']}")

# That answer looks a little better!

Question: What is the default AUTOINDEX in Milvus Client?
Answer: MetricType.L2


In [None]:
# # Shut down and cleanup the milvus server.
default_server.stop()
default_server.cleanup()

In [None]:
# Props to Sebastian Raschka for this handy watermark.
# !pip install watermark

%load_ext watermark
%watermark -a 'Christy Bergman' -v -p torch,transformers,milvus,pymilvus,langchain --conda