# ReadtheDocs Retrieval Augmented Generation (RAG) using Zilliz Free Tier

In this notebook, we are going to use Milvus documentation pages to create a chatbot about our product.  The chatbot is going to follow RAG steps to retrieve chunks of data using Semantic Vector Search, then the Question + Context will be fed as a Prompt to a LLM to generate an answer.

Many RAG demos use OpenAI for the Embedding Model and ChatGPT for the Generative AI model.  **In this notebook, we will demo a fully open source RAG stack.**

Using open-source Q&A with retrieval saves money since we make free calls to our own data almost all the time - retrieval, evaluation, and development iterations.  We only make a paid call to OpenAI once for the final chat generation step. 

<div>
<img src="../../images/rag_image.png" width="80%"/>
</div>

Let's get started!

In [1]:
# For colab install these libraries in this order:
# !pip install pymilvus, langchain, torch, transformers, python-dotenv

# Import common libraries.
import sys, os, time, pprint
import numpy as np

# Import custom functions for splitting and search.
sys.path.append("..")  # Adds higher directory to python modules path.
import milvus_utilities as _utils

## Download Milvus documentation to a local directory.

The data we’ll use is our own product documentation web pages.  ReadTheDocs is an open-source free software documentation hosting platform, where documentation is written with the Sphinx document generator.

The code block below downloads the web pages into a local directory called `rtdocs`.  

I've already uploaded the `rtdocs` data folder to github, so you should see it if you cloned my repo.

In [2]:
# # Uncomment to download readthedocs pages locally.

# DOCS_PAGE="https://pymilvus.readthedocs.io/en/latest/"
# !echo $DOCS_PAGE

# # Specify encoding to handle non-unicode characters in documentation.
# !wget -r -A.html -P rtdocs --header="Accept-Charset: UTF-8" $DOCS_PAGE

## Start up a Zilliz free tier cluster.

Code in this notebook uses fully-managed Milvus on [Ziliz Cloud free trial](https://cloud.zilliz.com/login).  
  1. Choose the default "Starter" option when you provision > Create collection > Give it a name > Create cluster and collection.  
  2. On the Cluster main page, copy your `API Key` and store it locally in a .env variable.  See note below how to do that.
  3. Also on the Cluster main page, copy the `Public Endpoint URI`.

💡 Note: To keep your tokens private, best practice is to use an **env variable**.  See [how to save api key in env variable](https://help.openai.com/en/articles/5112595-best-practices-for-api-key-safety). <br>

In Jupyter, you also need a .env file (in same dir as notebooks) containing lines like this:
- VARIABLE_NAME=value


In [3]:
# STEP 1. CONNECT TO MILVUS

# !pip install pymilvus #python sdk for milvus
from pymilvus import connections, utility

# Jupyter notebooks:
# from dotenv import load_dotenv
# load_dotenv()
# TOKEN = os.getenv("ZILLIZ_API_KEY")

# Usual way:
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file
TOKEN = os.environ["ZILLIZ_API_KEY"]

# Connect to Zilliz cloud using endpoint URI and API key TOKEN.
# TODO change this.
CLUSTER_ENDPOINT="https://in03-xxxx.api.gcp-us-west1.zillizcloud.com:443"
CLUSTER_ENDPOINT="https://in03-48a5b11fae525c9.api.gcp-us-west1.zillizcloud.com:443"
connections.connect(
  alias='default',
  #  Public endpoint obtained from Zilliz Cloud
  uri=CLUSTER_ENDPOINT,
  # API key or a colon-separated cluster username and password
  token=TOKEN,
)

# Check if the server is ready and get colleciton name.
print(f"Type of server: {utility.get_server_version()}")

Type of server: zilliz_cloud


## Load the Embedding Model checkpoint and use it to create vector embeddings
**Embedding model:**  We will use the open-source [sentence transformers](https://www.sbert.net/docs/pretrained_models.html) available on HuggingFace to encode the documentation text.  We will download the model from HuggingFace and run it locally. 

Two model parameters of note below:
1. EMBEDDING_LENGTH refers to the dimensionality or length of the embedding vector. In this case, the embeddings generated for EACH token in the input text will have the SAME length = 1024. This size of embedding is often associated with BERT-based models, where the embeddings are used for downstream tasks such as classification, question answering, or text generation. <br><br>
2. MAX_SEQ_LENGTH is the maximum length the encoder model can handle for input sequences. In this case, if sequences longer than 512 tokens are given to the model, everything longer will be (silently!) chopped off.  This is the reason why a chunking strategy is needed to segment input texts into chunks with lengths that will fit in the model's input.

In [4]:
# STEP 2. DOWNLOAD AN OPEN SOURCE EMBEDDING MODEL.

# Import torch.
import torch
from torch.nn import functional as F
from sentence_transformers import SentenceTransformer

# Initialize torch settings
torch.backends.cudnn.deterministic = True
DEVICE = torch.device('cuda:3' if torch.cuda.is_available() else 'cpu')
print(f"device: {DEVICE}")

# Load the model from huggingface model hub.
# python -m pip install -U angle-emb
model_name = "WhereIsAI/UAE-Large-V1"
encoder = SentenceTransformer(model_name, device=DEVICE)
print(type(encoder))
print(encoder)

# Get the model parameters and save for later.
EMBEDDING_LENGTH = encoder.get_sentence_embedding_dimension()
MAX_SEQ_LENGTH_IN_TOKENS = encoder.get_max_seq_length() 
# # Assume tokens are 3 characters long.
# MAX_SEQ_LENGTH = MAX_SEQ_LENGTH_IN_TOKENS * 3
# HF_EOS_TOKEN_LENGTH = 1 * 3
# Test with 512 sequence length.
MAX_SEQ_LENGTH = MAX_SEQ_LENGTH_IN_TOKENS
HF_EOS_TOKEN_LENGTH = 1

# Inspect model parameters.
print(f"model_name: {model_name}")
print(f"EMBEDDING_LENGTH: {EMBEDDING_LENGTH}")
print(f"MAX_SEQ_LENGTH: {MAX_SEQ_LENGTH}")

device: cpu


No sentence-transformers model found with name /Users/christybergman/.cache/torch/sentence_transformers/WhereIsAI_UAE-Large-V1. Creating a new one with MEAN pooling.


<class 'sentence_transformers.SentenceTransformer.SentenceTransformer'>
SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)
model_name: WhereIsAI/UAE-Large-V1
EMBEDDING_LENGTH: 1024
MAX_SEQ_LENGTH: 512


## Create a Milvus collection

You can think of a collection in Milvus like a "table" in SQL databases.  The **collection** will contain the 
- **Schema** (or [no-schema Milvus client](https://milvus.io/docs/using_milvusclient.md)).  
💡 You'll need the vector `EMBEDDING_LENGTH` parameter from your embedding model.
Typical values are:
   - 768 for sbert embedding models
   - 1536 for ada-002 OpenAI embedding models
- **Vector index** for efficient vector search
- **Vector distance metric** for measuring nearest neighbor vectors
- **Consistency level**
In Milvus, transactional consistency is possible; however, according to the [CAP theorem](https://en.wikipedia.org/wiki/CAP_theorem), some latency must be sacrificed. 💡 Searching movie reviews is not mission-critical, so [`eventually`](https://milvus.io/docs/consistency.md) consistent is fine here.

## Add a Vector Index

The vector index determines the vector **search algorithm** used to find the closest vectors in your data to the query a user submits.  

Most vector indexes use different sets of parameters depending on whether the database is:
- **inserting vectors** (creation mode) - vs - 
- **searching vectors** (search mode) 

Scroll down the [docs page](https://milvus.io/docs/index.md) to see a table listing different vector indexes available on Milvus.  For example:
- FLAT - deterministic exhaustive search
- IVF_FLAT or IVF_SQ8 - Hash index (stochastic approximate search)
- HNSW - Graph index (stochastic approximate search)
- AUTOINDEX - Automatically determined based on OSS vs [Zilliz cloud](https://docs.zilliz.com/docs/autoindex-explained), type of GPU, size of data.

Besides a search algorithm, we also need to specify a **distance metric**, that is, a definition of what is considered "close" in vector space.  In the cell below, the [`HNSW`](https://github.com/nmslib/hnswlib/blob/master/ALGO_PARAMS.md) search index is chosen.  Its possible distance metrics are one of:
- L2 - L2-norm
- IP - Dot-product
- COSINE - Angular distance

💡 Most use cases work better with normalized embeddings, in which case L2 is useless (every vector has length=1) and IP and COSINE are the same.  Only choose L2 if you plan to keep your embeddings unnormalized.

In [5]:
# STEP 3. CREATE A NO-SCHEMA MILVUS COLLECTION AND DEFINE THE DATABASE INDEX.

from pymilvus import MilvusClient

# Set the Milvus collection name.
COLLECTION_NAME = "MilvusDocs"

# Add custom HNSW search index to the collection.
# M = max number graph connections per layer. Large M = denser graph.
# Choice of M: 4~64, larger M for larger data and larger embedding lengths.
M = 16
# efConstruction = num_candidate_nearest_neighbors per layer. 
# Use Rule of thumb: int. 8~512, efConstruction = M * 2.
efConstruction = M * 2
# Create the search index for local Milvus server.
INDEX_PARAMS = dict({
    'M': M,               
    "efConstruction": efConstruction })
index_params = {
    "index_type": "HNSW", 
    "metric_type": "COSINE", 
    "params": INDEX_PARAMS
    }

# Use no-schema Milvus client uses flexible json key:value format.
# https://milvus.io/docs/using_milvusclient.md
mc = MilvusClient(
    uri=CLUSTER_ENDPOINT,
    # API key or a colon-separated cluster username and password
    token=TOKEN)

# Check if collection already exists, if so drop it.
has = utility.has_collection(COLLECTION_NAME)
if has:
    drop_result = utility.drop_collection(COLLECTION_NAME)
    print(f"Successfully dropped collection: `{COLLECTION_NAME}`")

# Create the collection.
mc.create_collection(COLLECTION_NAME, 
                     EMBEDDING_LENGTH,
                     consistency_level="Eventually", 
                     auto_id=True,  
                     overwrite=True,
                     # skip setting params below, if using AUTOINDEX
                     params=index_params
                    )

print(f"Successfully created collection: `{COLLECTION_NAME}`")
print(mc.describe_collection(COLLECTION_NAME))

Successfully created collection: `MilvusDocs`
{'collection_name': 'MilvusDocs', 'auto_id': True, 'num_shards': 1, 'description': '', 'fields': [{'field_id': 100, 'name': 'id', 'description': '', 'type': 5, 'params': {}, 'element_type': 0, 'auto_id': True, 'is_primary': True}, {'field_id': 101, 'name': 'vector', 'description': '', 'type': 101, 'params': {'dim': 1024}, 'element_type': 0}], 'aliases': [], 'collection_id': 446268198625172175, 'consistency_level': 3, 'properties': {}, 'num_partitions': 1, 'enable_dynamic_field': True}


## Chunking

Before embedding, it is necessary to decide your chunk strategy, chunk size, and chunk overlap.  In this demo, I will use:
- **Strategy** = Use markdown header hierarchies.  Keep markdown sections together unless they are too long.
- **Chunk size** = Use the embedding model's parameter `MAX_SEQ_LENGTH`
- **Overlap** = Rule-of-thumb 10-15%
- **Function** = 
  - Langchain's `HTMLHeaderTextSplitter` to split markdown sections.
  - Langchain's `RecursiveCharacterTextSplitter` to split up long reviews recursively.


Notice below, each chunk is grounded with the document source page.  <br>
In addition, header titles are kept together with the chunk of markdown text.

In [6]:
# STEP 4. PREPARE DATA: CHUNK AND EMBED

## Read docs into LangChain using v 0.0.322
#!pip install langchain beautifulsoup4
from langchain.document_loaders import ReadTheDocsLoader

loader = ReadTheDocsLoader("../RAG/rtdocs/pymilvus.readthedocs.io/en/latest/"
                           , encoding="utf-8"
                           , features="html.parser")
docs = loader.load()

num_documents = len(docs)
print(f"loaded {num_documents} documents")

# Langchain v 0.0.354
# from langchain_community.document_loaders.readthedocs import ReadTheDocsLoader

# # Create an instance of ReadTheDocsLoader
# loader = ReadTheDocsLoader("../RAG/rtdocs/pymilvus.readthedocs.io/en/latest/", 
#                            encoding="utf-8")

# # Load the documents
# docs = loader.load()

# num_documents = len(docs)
# print(f"loaded {num_documents} documents")

loaded 8 documents


In [7]:
from langchain.text_splitter import HTMLHeaderTextSplitter, RecursiveCharacterTextSplitter
from bs4 import BeautifulSoup

# Define the headers to split on for the HTMLHeaderTextSplitter
headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
]
# Create an instance of the HTMLHeaderTextSplitter
html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)

# Use the embedding model parameters.
chunk_size = MAX_SEQ_LENGTH - HF_EOS_TOKEN_LENGTH
chunk_overlap = np.round(chunk_size * 0.10, 0)
print(f"chunk_size: {chunk_size}, chunk_overlap: {chunk_overlap}")

# Create an instance of the RecursiveCharacterTextSplitter
child_splitter = RecursiveCharacterTextSplitter(
    chunk_size = chunk_size,
    chunk_overlap = chunk_overlap,
    length_function = len,
)

# Split the HTML text using the HTMLHeaderTextSplitter.
start_time = time.time()
html_header_splits = []
for doc in docs:
    soup = BeautifulSoup(doc.page_content, 'html.parser')
    splits = html_splitter.split_text(str(soup))
    for split in splits:
        # Add the source URL and header values to the metadata
        metadata = {}
        new_text = split.page_content
        for header_name, metadata_header_name in headers_to_split_on:
            header_value = new_text.split("¶ ")[0].strip()
            metadata[header_name] = header_value
            try:
                new_text = new_text.split("¶ ")[1].strip()
            except:
                break
        split.metadata = {
            **metadata,
            "source": doc.metadata["source"]
        }
        # Add the header to the text
        split.page_content = split.page_content
    html_header_splits.extend(splits)

# Split the documents further into smaller, recursive chunks.
chunks = child_splitter.split_documents(html_header_splits)

end_time = time.time()
print(f"chunking time: {end_time - start_time}")
print(f"docs: {len(docs)}, split into: {len(html_header_splits)}")
print(f"split into chunks: {len(chunks)}, type: list of {type(chunks[0])}") 

# Inspect a chunk.
print()
print("Looking at a sample chunk...")
print(chunks[0].page_content[:100])
print(chunks[0].metadata)

# # TODO - Uncomment to print child splits with their associated header metadata.
# print()
# for child in chunks:
#     print(f"Content: {child.page_content}")
#     print(f"Metadata: {child.metadata}")
#     print()

chunk_size: 511, chunk_overlap: 51.0
chunking time: 0.014161109924316406
docs: 8, split into: 8
split into chunks: 156, type: list of <class 'langchain.schema.document.Document'>

Looking at a sample chunk...
Installation¶ Installing via pip¶ PyMilvus is in the Python Package Index. PyMilvus only support pyt
{'h1': 'Installation', 'h2': 'Installing via pip', 'source': '../RAG/rtdocs/pymilvus.readthedocs.io/en/latest/install.html'}


In [8]:
# Clean up the metadata urls
for doc in chunks:
    new_url = doc.metadata["source"]
    new_url = new_url.replace("../RAG/rtdocs", "https:/")
    doc.metadata.update({"source": new_url})

print(chunks[0].page_content[:100])
print(chunks[0].metadata)

Installation¶ Installing via pip¶ PyMilvus is in the Python Package Index. PyMilvus only support pyt
{'h1': 'Installation', 'h2': 'Installing via pip', 'source': 'https://pymilvus.readthedocs.io/en/latest/install.html'}


## Insert data into Milvus

For each original text chunk, we'll write the quadruplet (`vector, text, source, h1, h2`) into the database.

<div>
<img src="../../images/db_insert.png" width="80%"/>
</div>

**The Milvus Client wrapper can only handle loading data from a list of dictionaries.**

Otherwise, in general, Milvus supports loading data from:
- pandas dataframes 
- list of dictionaries

Below, we use the embedding model provided by HuggingFace, download its checkpoint, and run it locally as the encoder.  

In [9]:
# STEP 5. INSERT CHUNKS AND EMBEDDINGS IN ZILLIZ.

# Convert chunks to a list of dictionaries.
chunk_list = []
for chunk in chunks:

    # Generate embeddings using encoder from HuggingFace.
    embeddings = torch.tensor(encoder.encode([chunk.page_content]))
    embeddings = F.normalize(embeddings, p=2, dim=1)
    converted_values = list(map(np.float32, embeddings))[0]
    
    # Only use h1, h2. Truncate the metadata in case too long.
    try:
        h2 = chunk.metadata['h2'][:50]
    except:
        h2 = ""
    # Assemble embedding vector, original text chunk, metadata.
    chunk_dict = {
        'vector': converted_values,
        'chunk': chunk.page_content,
        'source': chunk.metadata['source'],
        'h1': chunk.metadata['h1'][:50],
        'h2': h2,
    }
    chunk_list.append(chunk_dict)

# Insert data into the Milvus collection.
print("Start inserting entities")
start_time = time.time()
insert_result = mc.insert(
    COLLECTION_NAME,
    data=chunk_list,
    progress_bar=True)
end_time = time.time()
print(f"Milvus Client insert time for {len(chunk_list)} vectors: {end_time - start_time} seconds")

# After final entity is inserted, call flush to stop growing segments left in memory.
mc.flush(COLLECTION_NAME)

# Milvus Client insert time for 156 vectors: 1.283660888671875 seconds

Start inserting entities


100%|██████████| 1/1 [00:01<00:00,  1.65s/it]


Milvus Client insert time for 156 vectors: 1.6547510623931885 seconds


## Ask a question about your data

So far in this demo notebook: 
1. Your custom data has been mapped into a vector embedding space
2. Those vector embeddings have been saved into a vector database

Next, you can ask a question about your custom data!

💡 In LLM vocabulary:
> **Query** is the generic term for user questions.  
A query is a list of multiple individual questions, up to maybe 1000 different questions!

> **Question** usually refers to a single user question.  
In our example below, the user question is "What is AUTOINDEX in Milvus Client?"

> **Semantic Search** = very fast search of the entire knowledge base to find the `TOP_K` documentation chunks with the closest embeddings to the user's query.

💡 The same model should always be used for consistency for all the embeddings data and the query.

In [10]:
# Read questions and ground truth answers into a pandas dataframe.
import pandas as pd

# Read ground truth answers from file.
eval_df = pd.read_csv("../../../christy_coding_scratch/data/milvus_ground_truth.csv", 
                      header=0, skip_blank_lines=True)
display(eval_df.head())

# Get all the questions.
query = eval_df.Question
print(len(query))
print(f"query = {query}")

# Get all the truth answers.
truth_answers = eval_df.ground_truth_answer
print(len(truth_answers))
print(f"truth_answers = {truth_answers}")

# Get all the truth uris.
truth_uris = eval_df.Uri
print(len(truth_uris))
print(f"truth_uris = {truth_uris}")

Unnamed: 0,Question,ground_truth_answer,Uri,retrieval_chunk_text,H1,H2,assistant_answer,Score,Reason
0,What do the parameters for HNSW mean?\n,- M: maximum degree of nodes in a layer of the...,https://pymilvus.readthedocs.io/en/latest/para...,"performance, HNSW limits the maximum degree of...",Index,Milvus support to create index to accelerate v...,,,
1,What are HNSW good default parameters when dat...,"M=16, efConstruction=32, ef=32",https://pymilvus.readthedocs.io/en/latest/para...,,,,,,
2,what is the default distance metric used in AU...,"Trick answer: IP inner product, not yet updat...",https://pymilvus.readthedocs.io/en/latest/tuto...,The attributes of collection can be extracted ...,,,,,
3,How did New York City get its name?,"In the 1600’s, the Dutch planted a trading pos...",https://en.wikipedia.org/wiki/New_York_City,Etymology\nSee also: Nicknames of New York Cit...,,,,,


4
query = 0              What do the parameters for HNSW mean?\n
1    What are HNSW good default parameters when dat...
2    what is the default distance metric used in AU...
3                  How did New York City get its name?
Name: Question, dtype: object
4
truth_answers = 0    - M: maximum degree of nodes in a layer of the...
1                       M=16, efConstruction=32, ef=32
2    Trick answer:  IP inner product, not yet updat...
3    In the 1600’s, the Dutch planted a trading pos...
Name: ground_truth_answer, dtype: object
4
truth_uris = 0    https://pymilvus.readthedocs.io/en/latest/para...
1    https://pymilvus.readthedocs.io/en/latest/para...
2    https://pymilvus.readthedocs.io/en/latest/tuto...
3          https://en.wikipedia.org/wiki/New_York_City
Name: Uri, dtype: object


In [11]:
# Choose a question, answer, uri, and chunk.
QUESTION_NUMBER = 0
SAMPLE_QUESTION = query[QUESTION_NUMBER]
print(f"question = {SAMPLE_QUESTION}")

truth_answer = truth_answers[QUESTION_NUMBER]
truth_uri = truth_uris[QUESTION_NUMBER]

question = What do the parameters for HNSW mean?



## Execute a vector search

Search Milvus using [PyMilvus API](https://milvus.io/docs/search.md).

💡 By their nature, vector searches are "semantic" searches.  For example, if you were to search for "leaky faucet": 
> **Traditional Key-word Search** - either or both words "leaky", "faucet" would have to match some text in order to return a web page or link text to the document.

> **Semantic search** - results containing words "drippy" "taps" would be returned as well because these words mean the same thing even though they are different words,

In [12]:
def search_milvus(mc, question, top_k):
    # Wrap the mc.search() call in a function

    # Embed the question using the same encoder.
    query_embeddings = _utils.embed_query(encoder, [question])

    # Return top k results with HNSW index.
    SEARCH_PARAMS = dict({
        # Re-use index param for num_candidate_nearest_neighbors.
        "ef": INDEX_PARAMS['efConstruction']
        })

    # Define output fields to return.
    OUTPUT_FIELDS = ["h1", "h2", "source", "chunk"]

    answers = mc.search(
        COLLECTION_NAME,
        data=query_embeddings, 
        search_params=SEARCH_PARAMS,
        output_fields=OUTPUT_FIELDS, 
        # Milvus can utilize metadata in boolean expressions to filter search.
        # filter="",
        limit=top_k,
        consistency_level="Eventually"
    )
    return answers

In [13]:
# RETRIEVAL USING MILVUS API.

# # Not needed with Milvus Client API.
# mc.load()

# Define output fields to return.
OUTPUT_FIELDS = ["h1", "h2", "source", "chunk"]

# Run semantic vector search using your query and the vector database.
TOP_K = 3
start_time = time.time()
result = search_milvus(mc, SAMPLE_QUESTION, TOP_K)

elapsed_time = time.time() - start_time
print(f"Milvus Client search time for {len(chunk_list)} vectors: {elapsed_time} seconds")

# Inspect search result.
print(f"type: {type(result[0])}, count: {len(result[0])}")

# Milvus Client search time for 156 vectors: 0.1264362335205078 seconds
# type: <class 'list'>, count: 3

# Extract the retrieval answer.
retrieval_answer = result[0][0]['entity']['chunk']
print(f"chunk_answer: {retrieval_answer[:150]}")


Milvus Client search time for 156 vectors: 0.20205092430114746 seconds
type: <class 'list'>, count: 3
chunk_answer: performance, HNSW limits the maximum degree of nodes on each layer of the graph to M. In addition, you can use efConstruction (when building index) or


In [14]:
# Repeat Retrieval step, but loop through list of questions.

# # Not needed with Milvus Client API.
# mc.load()

# Run similarity_search for all questions in the query list
TOP_K = 1
start_time = time.time()
retrieved_results = [search_milvus(mc, question, TOP_K)
           for question in query]
elapsed_time = time.time() - start_time
print(f"LangChain Zilliz search time for {len(chunks)} vectors: {elapsed_time} seconds")

# Extract list of 0th top_k chunks per question.
retrieval_answers = [result[0][0]['entity']['chunk'] for result in retrieved_results]
print(f"count retrieval answers: {len(retrieval_answers)}")

# TODO: Uncomment to print the results
for i, result_list in enumerate(retrieved_results):
    print(f"RESULTS FOR QUESTION #{i+1}:")
    for j, result in enumerate(result_list):
        for k, top_k_result in enumerate(result):
            print(f"top_k: {k+1}:")
            print(top_k_result)

LangChain Zilliz search time for 156 vectors: 0.7929909229278564 seconds
count retrieval answers: 4
RESULTS FOR QUESTION #1:
top_k: 1:
{'id': 446268198608633780, 'distance': 0.7123057842254639, 'entity': {'chunk': 'performance, HNSW limits the maximum degree of nodes on each layer of the graph to M. In addition, you can use efConstruction (when building index) or ef (when searching targets) to specify a search range. building parameters: M: Maximum degree of the node. efConstruction: Take the effect in stage of index construction. # HNSW client.create_index(collection_name, IndexType.HNSW, { "M": 16, # int. 4~64 "efConstruction": 40 # int. 8~512 } ) search parameters: ef: Take the effect in stage of search scope,', 'source': 'https://pymilvus.readthedocs.io/en/latest/param.html', 'h1': 'Index', 'h2': 'Milvus support to create index to accelerate vecto'}}
RESULTS FOR QUESTION #2:
top_k: 1:
{'id': 446268198608633766, 'distance': 0.7082682847976685, 'entity': {'chunk': 'Metrics. Vector In

## Assemble and inspect the search results from your docs.

The search result is in the variable `results[0]` of type `'pymilvus.orm.search.SearchResult'`.  

In [15]:
# Assemble `num_shot_answers` retrieved 1st context and context metadata.
METADATA_FIELDS = [f for f in OUTPUT_FIELDS if f != 'chunk']
all_formatted_results = []
all_context = []
all_context_metadata = []

# Iterate over the results for each question
for question_results in retrieved_results:
    # Assemble the context and context metadata for the current question
    formatted_results, context, context_metadata = _utils.client_assemble_retrieved_context(
        question_results, metadata_fields=METADATA_FIELDS, num_shot_answers=3)
    
    # Append the formatted results, context, and context metadata to the corresponding lists
    all_formatted_results.append(formatted_results)
    all_context.append(context)
    all_context_metadata.append(context_metadata)

# # TODO - Uncomment to loop through each context and metadata and print.
for i, (context, context_metadata) in enumerate(zip(all_context, all_context_metadata)):
    for j in range(len(context)):
        print(f"QUESTION #{i+1}, top_k = {j+1}")
        print(f"Context: {context[j][:150]}")
        print(f"Metadata: {context_metadata[j]}")
        print()

QUESTION #1, top_k = 1
Context: performance, HNSW limits the maximum degree of nodes on each layer of the graph to M. In addition, you can use efConstruction (when building index) or
Metadata: {'h1': 'Index', 'h2': 'Milvus support to create index to accelerate vecto', 'source': 'https://pymilvus.readthedocs.io/en/latest/param.html'}

QUESTION #2, top_k = 1
Context: Metrics. Vector Index¶ FLAT IVF_FLAT IVF_SQ8 IVF_SQ8_H IVF_PQ HNSW ANNOY RNSG FLAT¶ If FLAT index is used, the vectors are stored in an array of float
Metadata: {'h1': 'Index', 'h2': 'Milvus support to create index to accelerate vecto', 'source': 'https://pymilvus.readthedocs.io/en/latest/param.html'}

QUESTION #3, top_k = 1
Context: metric_type=) The attributes of collection can be extracted from info. >>> info.collection_name 'demo_film_tutorial' >>> info.dimension 8 >>> info.ind
Metadata: {'h1': 'Tutorial', 'h2': 'This is a basic introduction to Milvus by PyMilvus', 'source': 'https://pymilvus.readthedocs.io/en/latest/tut

## Evaluate using an open source LLM as a judge.


In [55]:
# Choose a single truth, retrieval text pair.
truth = truth_answer
retrieval = retrieval_answer

print(f"truth: {truth[:100]}\n")
print(f"retrieval: {retrieval[:100]}\n")
eval_df.head(2)

truth: - M: maximum degree of nodes in a layer of the graph. - efConstruction: number of nearest neighbors 

retrieval: performance, HNSW limits the maximum degree of nodes on each layer of the graph to M. In addition, y



Unnamed: 0,Question,ground_truth_answer,Uri,retrieval_chunk_text,H1,H2,assistant_answer,Score,Reason
0,What do the parameters for HNSW mean?\n,- M: maximum degree of nodes in a layer of the...,https://pymilvus.readthedocs.io/en/latest/para...,"performance, HNSW limits the maximum degree of...",Index,Milvus support to create index to accelerate v...,,,
1,What are HNSW good default parameters when dat...,"M=16, efConstruction=32, ef=32",https://pymilvus.readthedocs.io/en/latest/para...,,,,,,


In [51]:
# Try using simple LLM-as-judge with zero-shot prompt.

import json, pprint
import openai, tiktoken
from openai import OpenAI

# Define the generation llm model to use.
LLM_NAME = "gpt-3.5-turbo-1106"
TEMPERATURE = 0.0

# Reasonable values for the penalty coefficients are around 0.1 to 1 if the aim is to just reduce repition 
# somewhat. To strongly suppress repetition, set coefficients = 2.
FREQUENCY_PENALTY = 2

# See how to save api key in env variable.
# https://help.openai.com/en/articles/5112595-best-practices-for-api-key-safety
openai_client = OpenAI(
    # This is the default and can be omitted
    # api_key=os.environ.get("OPENAI_API_KEY"),
    api_key=os.environ["OPENAI_API_KEY"],
)

In [64]:
# Function to call OpenAI LLM as judge on zero-shot task.
def get_openai_score(llm_name, user_prompt,
                     temperature=0.0, random_seed=415, frequency_penalty=2, max_tokens=500):

    SYSTEM_PROMPT = f"""
    You are a fair, impartial judge.
    """
        
    # Define the OpenAIEvaluator.
    responses = openai_client.chat.completions.create(
        response_format={
            "type": "json_object", 
            # "schema": Result.schema_json()
        },
        messages=[
            # {"role": "system", "content": "You are a helpful assistant."},  # background tone
            # {"role": "user", "content": "Who won the world series in 2020?"}, # question
            # Use assistant messages to provide what was previously said in multi-turn conversations.
            # {"role": "assistant", "content": "The Los Angeles Dodgers won the World Series in 2020."}, 

            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": user_prompt}
        ],
        model=llm_name,
        temperature=temperature, # the degree of randomness of the model's output
        seed=random_seed,  # for reproducibility
        frequency_penalty=frequency_penalty, # allowed amount of repitition in the model's output
        max_tokens=max_tokens # maximum number of tokens the model can output
    )

    # Make sure total_tokens < 4096.
    token_dict = {
        'prompt_tokens':responses.usage.prompt_tokens,
        'completion_tokens':responses.usage.completion_tokens,
        'total_tokens':responses.usage.total_tokens,
    }

    # Print answer as a JSON object.
    openai_response = responses.choices[0].message.content
    json_response = json.loads(openai_response)
    json_response # single json object with 3 fields

    # Create a DataFrame from a list of dictionaries.
    response_df = pd.DataFrame([json_response])
    token_df = pd.DataFrame([token_dict])

    return response_df, token_df

In [70]:
text1 = truth
text2 = retrieval
question_number = QUESTION_NUMBER

ZERO_SHOT_PROMPT = f"""
For each question_number: {question_number}, 
calculate the similarity between these two texts, using semantic meaning, not word order. 
Text1: {text1}, Text2: {text2}.
Calculate llm_zero_shot_similarity_score as a number between 0 and 4, where 4 indicates identical content and 0 indicates completely different content.
Output JSON fields:
- question_number
- text1
- text2 
- llm_zero_shot_similarity_score
"""

'\nFor each question_number: 0, \ncalculate the similarity between these two texts, using semantic meaning, not word order. \nText1: - M: maximum degree of nodes in a layer of the graph.\u2028- efConstruction: number of nearest neighbors to consider when connecting nodes in the graph.\u2028- ef: number of nearest neighbors to consider when searching for similar vectors.  , Text2: performance, HNSW limits the maximum degree of nodes on each layer of the graph to M. In addition, you can use efConstruction (when building index) or ef (when searching targets) to specify a search range. building parameters: M: Maximum degree of the node. efConstruction: Take the effect in stage of index construction. # HNSW client.create_index(collection_name, IndexType.HNSW, { "M": 16, # int. 4~64 "efConstruction": 40 # int. 8~512 } ) search parameters: ef: Take the effect in stage of search scope,.\nCalculate llm_zero_shot_similarity_score as a number between 0 and 4, where 4 indicates identical content a

In [54]:
# Test zero-shot LLM as judge on a single question.
# Doc Openai function calling: https://platform.openai.com/docs/guides/function-calling

# # CAREFUL!! THIS COSTS MONEY!!
# start_time = time.time()
# result_df, token_df = get_openai_score(LLM_NAME, ZERO_SHOT_PROMPT, TEMPERATURE)
# elapsed_time = time.time() - start_time
# print(f"LLM as judge zero-shot took: {elapsed_time} seconds")

display(result_df.head())  # score = 2
token_df.head()

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


LLM as judge zero-shot took: 18.57730984687805 seconds


Unnamed: 0,question_number,text1,text2,llm_zero_shot_similarity_score
0,0,"In the 1600’s, the Dutch planted a trading pos...",def name(self): return self._name @property de...,0


Unnamed: 0,prompt_tokens,completion_tokens,total_tokens
0,322,230,552


In [73]:
# Use zero-shot LLM as judge on all question/retrieval pairs.
# Drop token counting for now.

ZERO_SHOT_PROMPT_TEMPLATE = """For each question_number: {question_number}, 
calculate the similarity between these two texts, using semantic meaning, not word order. 
Text1: {text1}, Text2: {text2}.
Calculate llm_zero_shot_similarity_score as a number between 0 and 4, where 4 indicates identical content and 0 indicates completely different content.
Output JSON fields:
- question_number
- text1
- text2 
- llm_zero_shot_similarity_score
"""

# Loop through the truth texts and retrieval texts, evaluate each pair using the LLM.
results_list = []
tokens_list = []
i = 0

# CAREFUL!! THIS COSTS MONEY!!
start_time = time.time()
for truth, retrieval in zip(truth_answers, retrieval_answers):

    # Construct the zero-shot prompt.
    zero_shot_prompt = ZERO_SHOT_PROMPT_TEMPLATE.format(question_number=i, text1=truth, text2=retrieval)

    # Generate zero-shot llm as judge score.
    temp_df, tempt_df = get_openai_score(LLM_NAME, zero_shot_prompt, TEMPERATURE)
    results_list.append(temp_df)
    tokens_list.append(tempt_df)
    i += 1
elapsed_time = time.time() - start_time
print(f"LLM as judge zero-shot took: {elapsed_time} seconds")

# Create a DataFrame from the pandas dataframes.
results_df = pd.concat(results_list, ignore_index=True)
tokens_df = pd.concat(tokens_list, ignore_index=True)
display(results_df.head())
tokens_df.head()

LLM as judge zero-shot took: 52.86700487136841 seconds


Unnamed: 0,question_number,text1,text2,llm_zero_shot_similarity_score
0,0,- M: maximum degree of nodes in a layer of the...,"performance, HNSW limits the maximum degree of...",2
1,1,"M=16, efConstruction=32, ef=32",Metrics. Vector Index¶ FLAT IVF_FLAT IVF_SQ8 I...,0
2,2,"Trick answer: IP inner product, not yet update...",metric_type=) The attributes of collection can...,0
3,3,"In the 1600’s, the Dutch planted a trading pos...",def name(self): return self._name @property de...,0


This is an example prompt with placeholders for question_number: 4, text1: - M: maximum degree of nodes in a layer of the graph. - efConstruction: number of nearest neighbors to consider when connecting nodes in the graph. - ef: number of nearest neighbors to consider when searching for similar vectors.  , and text2: performance, HNSW limits the maximum degree of nodes on each layer of the graph to M. In addition, you can use efConstruction (when building index) or ef (when searching targets) to specify a search range. building parameters: M: Maximum degree of the node. efConstruction: Take the effect in stage of index construction. # HNSW client.create_index(collection_name, IndexType.HNSW, { "M": 16, # int. 4~64 "efConstruction": 40 # int. 8~512 } ) search parameters: ef: Take the effect in stage of search scope,. The curly braces here {} are escaped.
This is an example prompt with placeholders for question_number: 4, text1: M=16, efConstruction=32, ef=32, and text2: Metrics. Vector 

In [61]:
# Now write a Few Shots Learning Prompt.
# Ask for more scores and give explicit examples for each score.
text1 = truth
text2 = retrieval
question_number = QUESTION_NUMBER

FEW_SHOT_PROMPT = f"""
For each question_number: {question_number}, calculate the similarity between these two texts: 
Text1: {text1}, Text2: {text2}.

  You'll be given a function grading_function which you'll call for each text pair to submit your reasoning and score for the Correctness and Cmpleteness of the answer. 

  Below is your grading rubric: 

- Correctness: If Text2 contains the same key facts as Text1, below are the details for different scores:

  - Score = 0: Text2 is completely incorrect, doesn’t mention anything about Text1 or is completely contrary to Text1.

      - For example, when Text2 is empty string, or content that’s completely irrelevant, or sorry I don’t know the answer.

  - Score = 1: Text2 is hallucinating on any of the facts from Text1.

      - Example:

          - Text1: "L2 according to documentation, but in the code it is IP inner product."

          - Answer: "Jaccard"

  - Score = 2: If Text2 provides some facts from Text1.

      - Example:

          - Text1: "L2 according to documentation, but in the code it is IP inner product."

          - Text2: “L2"

  - Score = 3: If Text2 correctly answers the question not missing any major facts

      - Example:

          - Text1: "L2 according to documentation, but in the code it is IP inner product."

          - Text2:  "L2 or IP"

- Completeness: How complete is the answer, does it fully answer all aspects of the question and provide comprehensive explanation and other necessary information. Below are the details for different scores:

  - Score 0: If Text2 is completely incorrect, then the completeness is also zero score.

  - Score 1: if the answer is correct but too short to fully answer the question, then we can give score 1 for completeness.

      - Example:

          - Text1: "The parameters for HNSW are M and efConstruction during construction. During search param is ef."

          - Text2: "The parameters for HNSW are M."

  - Score 2: Text2 is missing description about details. Or is completely missing one minor fact.

      - Example:

          - Text1: "The parameters for HNSW are M and efConstruction during construction. During search param is ef."

          - Text2: "The parameters for HNSW are M, efConstruction, and ef."

      - Example:

          - Text1: "The parameters for HNSW are M and efConstruction during construction. During search param is ef."

          - Text2: "The parameters for HNSW are M and efConstruction."

  - Score 3: Text2 is correct, and covers all the main aspects of the question

      - Example:

          - Text1: "The parameters for HNSW are M and efConstruction during construction. During search param is ef."

          - Text2: "The parameters for HNSW are M and ef during construction and ef during search."

      - Example:

          - Text1: "The parameters for HNSW are M and efConstruction during construction. During search param is ef."

          - Text2: "The parameters for HNSW are ef during search.  M and ef during construction."

- Then final rating:

    - llm_few_shot_similarity_score: 60% correctness + 40% completeness

Output JSON fields:
- question_number
- llm_few_shot_similarity_score
"""

In [62]:
# Test few-shot LLM as judge on a single question.
# Doc Openai function calling: https://platform.openai.com/docs/guides/function-calling

# # CAREFUL!! THIS COSTS MONEY!!
# start_time = time.time()
# result2_df, token_df = get_openai_score(LLM_NAME, FEW_SHOT_PROMPT, TEMPERATURE)
# elapsed_time = time.time() - start_time
# print(f"LLM as judge few-shot took: {elapsed_time} seconds")

display(result2_df.head())  # score = 2.4
token_df.head()

# question_number	llm_few_shot_similarity_score
# 0	0	2.4
# prompt_tokens	completion_tokens	total_tokens
# 0	941	28	969

LLM as judge few-shot took: 3.3466129302978516 seconds


Unnamed: 0,question_number,llm_few_shot_similarity_score
0,0,2.4


Unnamed: 0,prompt_tokens,completion_tokens,total_tokens
0,941,28,969


In [75]:
FEW_SHOT_PROMPT_TEMPLATE = """For each question_number: {question_number}, calculate the similarity between these two texts: 
Text1: {text1}, Text2: {text2}.

  You'll be given a function grading_function which you'll call for each text pair to submit your reasoning and score for the Correctness and Cmpleteness of the answer. 

  Below is your grading rubric: 

- Correctness: If Text2 contains the same key facts as Text1, below are the details for different scores:

  - Score = 0: Text2 is completely incorrect, doesn’t mention anything about Text1 or is completely contrary to Text1.

      - For example, when Text2 is empty string, or content that’s completely irrelevant, or sorry I don’t know the answer.

  - Score = 1: Text2 is hallucinating on any of the facts from Text1.

      - Example:

          - Text1: "L2 according to documentation, but in the code it is IP inner product."

          - Answer: "Jaccard"

  - Score = 2: If Text2 provides some facts from Text1.

      - Example:

          - Text1: "L2 according to documentation, but in the code it is IP inner product."

          - Text2: “L2"

  - Score = 3: If Text2 correctly answers the question not missing any major facts

      - Example:

          - Text1: "L2 according to documentation, but in the code it is IP inner product."

          - Text2:  "L2 or IP"

- Completeness: How complete is the answer, does it fully answer all aspects of the question and provide comprehensive explanation and other necessary information. Below are the details for different scores:

  - Score 0: If Text2 is completely incorrect, then the completeness is also zero score.

  - Score 1: if the answer is correct but too short to fully answer the question, then we can give score 1 for completeness.

      - Example:

          - Text1: "The parameters for HNSW are M and efConstruction during construction. During search param is ef."

          - Text2: "The parameters for HNSW are M."

  - Score 2: Text2 is missing description about details. Or is completely missing one minor fact.

      - Example:

          - Text1: "The parameters for HNSW are M and efConstruction during construction. During search param is ef."

          - Text2: "The parameters for HNSW are M, efConstruction, and ef."

      - Example:

          - Text1: "The parameters for HNSW are M and efConstruction during construction. During search param is ef."

          - Text2: "The parameters for HNSW are M and efConstruction."

  - Score 3: Text2 is correct, and covers all the main aspects of the question

      - Example:

          - Text1: "The parameters for HNSW are M and efConstruction during construction. During search param is ef."

          - Text2: "The parameters for HNSW are M and ef during construction and ef during search."

      - Example:

          - Text1: "The parameters for HNSW are M and efConstruction during construction. During search param is ef."

          - Text2: "The parameters for HNSW are ef during search.  M and ef during construction."

- Then final rating:

    - llm_few_shot_similarity_score: 60% correctness + 40% completeness

Output JSON fields:
- question_number
- llm_few_shot_similarity_score
"""

In [79]:
# Use few-shot LLM as judge on all question/retrieval pairs.
# Drop token counting for now.

# Loop through the truth texts and retrieval texts, evaluate each pair using the LLM.
results2_list = []
tokens_list = []
i = 0

# CAREFUL!! THIS COSTS MONEY!!
start_time = time.time()
for truth, retrieval in zip(truth_answers, retrieval_answers):

    # Construct the few-shot prompt.
    few_shot_prompt = FEW_SHOT_PROMPT_TEMPLATE.format(question_number=i, text1=truth, text2=retrieval)
    # print(few_shot_prompt[:50])

    # Generate zero-shot llm as judge score.
    temp_df, tempt_df = get_openai_score(LLM_NAME, few_shot_prompt, TEMPERATURE)
    results2_list.append(temp_df)
    tokens_list.append(tempt_df)
    i += 1
elapsed_time = time.time() - start_time
print(f"LLM as judge few-shot took: {elapsed_time} seconds")

# Create a DataFrame from the pandas dataframes.
results2_df = pd.concat(results2_list, ignore_index=True)
tokens_df = pd.concat(tokens_list, ignore_index=True)
display(results2_df.head())
tokens_df.head()

LLM as judge few-shot took: 16.285736083984375 seconds


Unnamed: 0,question_number,llm_few_shot_similarity_score
0,0,2.4
1,1,2.4
2,2,1.6
3,3,0.0


Unnamed: 0,prompt_tokens,completion_tokens,total_tokens
0,940,28,968
1,895,28,923
2,890,28,918
3,972,28,1000


In [80]:
# Convenience function to get sources from retrieved result object.
def get_references(result):
    sources = []

    for r in result:
        sources.append(r[0][0]['entity']['source'])

    return sources

# Define a binary score whether or not the retrieval source matches ground truth source.
def get_source_binary_score(truth_uris, retrieved_uris):
    """
    Returns 1 if the 0th retrieved uri matches the truth URI, else 0.
    """
    retrieval_scores = []
    for tr, rr in zip(truth_uris, retrieved_uris):
        # https://en.wikipedia.org/wiki/New_York_City
        # Parse out the last part of the URI.
        retrieval_score = 1 if tr.split("/")[-1] == rr.split("/")[-1] else 0
        retrieval_scores.append(retrieval_score)

    return retrieval_scores

In [81]:
# Calculate a rough, binary score if retrieval source matches the ground truth source.

# Get sources from retrieved results.
retrieved_uris = get_references(retrieved_results)
print(f"uris: {len(truth_uris)}, sources: {len(retrieved_uris)}")

# Calculate a rough, binary score if 0th retrieval source matches the ground truth source.
binary_scores = get_source_binary_score(truth_uris, retrieved_uris)
print(f"Binary score for retrieval = {binary_scores}")

# Append the binary sources score to the eval results dataframe.
results_df['binary_source_score'] = binary_scores
results_df.head()

uris: 4, sources: 4
Binary score for retrieval = [1, 1, 1, 0]


Unnamed: 0,question_number,text1,text2,llm_zero_shot_similarity_score,binary_source_score
0,0,- M: maximum degree of nodes in a layer of the...,"performance, HNSW limits the maximum degree of...",2,1
1,1,"M=16, efConstruction=32, ef=32",Metrics. Vector Index¶ FLAT IVF_FLAT IVF_SQ8 I...,0,1
2,2,"Trick answer: IP inner product, not yet update...",metric_type=) The attributes of collection can...,0,1
3,3,"In the 1600’s, the Dutch planted a trading pos...",def name(self): return self._name @property de...,0,0


In [82]:
# Append the few-shot scores to the eval results dataframe.
results_df['llm_few_shot_similarity_score'] = results2_df['llm_few_shot_similarity_score']
results_df.head()


Unnamed: 0,question_number,text1,text2,llm_zero_shot_similarity_score,binary_source_score,llm_few_shot_similarity_score
0,0,- M: maximum degree of nodes in a layer of the...,"performance, HNSW limits the maximum degree of...",2,1,2.4
1,1,"M=16, efConstruction=32, ef=32",Metrics. Vector Index¶ FLAT IVF_FLAT IVF_SQ8 I...,0,1,2.4
2,2,"Trick answer: IP inner product, not yet update...",metric_type=) The attributes of collection can...,0,1,1.6
3,3,"In the 1600’s, the Dutch planted a trading pos...",def name(self): return self._name @property de...,0,0,0.0


In [83]:
# Drop the zero_shot score, use few_shot score instead.

# Calculate a final eval score as a weighted average of the binary source score and the few-shot score.
results_df['final_score'] = (results_df['binary_source_score'] + results_df['llm_few_shot_similarity_score']) / 2
results_df.head()

Unnamed: 0,question_number,text1,text2,llm_zero_shot_similarity_score,binary_source_score,llm_few_shot_similarity_score,final_score
0,0,- M: maximum degree of nodes in a layer of the...,"performance, HNSW limits the maximum degree of...",2,1,2.4,1.7
1,1,"M=16, efConstruction=32, ef=32",Metrics. Vector Index¶ FLAT IVF_FLAT IVF_SQ8 I...,0,1,2.4,1.7
2,2,"Trick answer: IP inner product, not yet update...",metric_type=) The attributes of collection can...,0,1,1.6,1.3
3,3,"In the 1600’s, the Dutch planted a trading pos...",def name(self): return self._name @property de...,0,0,0.0,0.0


## Use an LLM to Generate a chat response to the user's question using the Retrieved Context.

Below, we'll use an open, very tiny generative AI model, or LLM, available on HuggingFace.  Many demos use OpenAI as the LLM choice instead.

In [None]:
# USING A TINY OSS LLM: ASK THE SAME QUESTION WITH RETRIEVED CONTEXT.

# Define the question and context
context_slice = context[0][111:257]
# Short prompt for tiny LLM
short_prompt = f"""Explain more using the Context or say "I don't know".
Context: {context_slice}
"""

# Set the encoding parameters
encoding_parameters = {
    "return_tensors": "pt",  # Return PyTorch tensors
    "max_length": MAX_SEQ_LENGTH,  # Maximum length for the encoded tokens
    "truncation": True,  # Enable truncation to avoid sequences longer than max_length
}

# Encode the inputs for question-answering
inputs = tokenizer.encode_plus(
    SAMPLE_QUESTION,  # The question to be asked
    context_slice,  # The context in which the question is asked
    # Replace context with a short prompt
    # short_prompt,
    **encoding_parameters  # The encoding parameters
)

# Generate the answer using the model
output = model(**inputs)
start_index = torch.argmax(output.start_logits)
end_index = torch.argmax(output.end_logits) + 1
answer = tokenizer.convert_tokens_to_string(
    tokenizer.convert_ids_to_tokens(inputs["input_ids"][0][start_index:end_index]))

# Print the generated answer
print("Generated Answer:", answer)

# Better answer but incomplete.

In [None]:
# Drop collection
utility.drop_collection(COLLECTION_NAME)

In [None]:
# Props to Sebastian Raschka for this handy watermark.
# !pip install watermark

%load_ext watermark
%watermark -a 'Christy Bergman' -v -p torch,transformers,sentence_transformers,pymilvus,langchain,openai --conda