# IMDB Vector Search using Milvus Client

First, import some common libraries and define the data reading functions.

In [1]:
# For colab install these libraries in this order:
# !pip install milvus, pymilvus, langchain, torch, transformers, python-dotenv

# Import common libraries.
from typing import List
import time
import pandas as pd
import numpy as np

# Import custom functions for splitting and search.
import imdb_utilities

## Start up a local Milvus server.

Code in this notebook uses [Milvus client](https://milvus.io/docs/using_milvusclient.md) with [Milvus lite](https://milvus.io/docs/milvus_lite.md), which runs a local server.  ⛔️ Milvus lite is only meant for demos and local testing.
- pip install milvus pymilvus

💡 **For production purposes**, use a local Milvus docker, Milvus clusters, or fully-managed Milvus on Zilliz Cloud.
- [Local Milvus docker](https://milvus.io/docs/install_standalone-docker.md) requires local docker installed and running.
- [Milvus clusters](https://milvus.io/docs/install_cluster-milvusoperator.md) requires a K8s cluster up and running.
- [Ziliz Cloud free trial](https://cloud.zilliz.com/login) choose a "free" option when you provision.


In [2]:
from milvus import default_server
from pymilvus import (
    connections, utility, 
    MilvusClient,
)

# Cleanup previous data and stop server in case it is still running.
default_server.stop()
default_server.cleanup()

# Start a new milvus-lite local server.
start_time = time.time()
default_server.start()

end_time = time.time()
print(f"Milvus server startup time: {end_time - start_time} sec")
# startup time: 5.6739208698272705

# Add wait to avoid error message from trying to connect.
time.sleep(15)

# Now you could connect with localhost and the given port.
# Port is defined by default_server.listen_port.
connections.connect(host='127.0.0.1', 
                  port=default_server.listen_port,
                  show_startup_banner=True)

# Check if the server is ready.
print(utility.get_server_version())

Milvus server startup time: 7.601609945297241 sec
v2.3.3-lite


## Load the Embedding Model checkpoint and use it to create vector embeddings
**Embedding model:**  We will use the open-source [sentence transformers](https://www.sbert.net/docs/pretrained_models.html) hosted on HuggingFace to encode the movie review text.  We will save the embeddings to a pandas dataframe and then into the milvus database.

Two model parameters of note below:
1. EMBEDDING_LENGTH refers to the dimensionality or length of the embedding vector. In this case, the embeddings generated for EACH token in the input text will have the SAME length = 768. This size of embedding is often associated with BERT-based models, where the embeddings are used for downstream tasks such as classification, question answering, or text generation. <br><br>
2. MAX_SEQ_LENGTH is the maximum length the encoder model can handle for input sequences. In this case, if sequences longer than 512 tokens are given to the model, everything longer will be (silently!) chopped off.  This is the reason why a chunking strategy is needed to segment input texts into chunks with lengths that will fit in the model's input.

In [3]:
# Import torch.
import torch
from torch.nn import functional as F
from sentence_transformers import SentenceTransformer

# Initialize torch settings
torch.backends.cudnn.deterministic = True
DEVICE = torch.device('cuda:3' if torch.cuda.is_available() else 'cpu')
print(f"device: {DEVICE}")

# Load the model from huggingface model hub.
model_name = "BAAI/bge-base-en-v1.5"
encoder = SentenceTransformer(model_name, device=DEVICE)
print(type(encoder))
print(encoder)

# Get the model parameters and save for later.
MAX_SEQ_LENGTH = encoder.get_max_seq_length() 
HF_EOS_TOKEN_LENGTH = 1
EMBEDDING_LENGTH = encoder.get_sentence_embedding_dimension()

# Inspect model parameters.
print(f"model_name: {model_name}")
print(f"EMBEDDING_LENGTH: {EMBEDDING_LENGTH}")
print(f"MAX_SEQ_LENGTH: {MAX_SEQ_LENGTH}")

device: cpu
<class 'sentence_transformers.SentenceTransformer.SentenceTransformer'>
SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': True}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)
model_name: BAAI/bge-base-en-v1.5
EMBEDDING_LENGTH: 768
MAX_SEQ_LENGTH: 512


## Create a Milvus collection

You can think of a collection in Milvus like a "table" in SQL databases.  The **collection** will contain the 
- **Schema** (or no-schema Milvus Client).  
💡 You'll need the vector `EMBEDDING_LENGTH` parameter from your embedding model.
- **Vector index** for efficient vector search
- **Vector distance metric** for measuring nearest neighbor vectors
- **Consistency level**
In Milvus, transactional consistency is possible; however, according to the [CAP theorem](https://en.wikipedia.org/wiki/CAP_theorem), some latency must be sacrificed. 💡 Searching movie reviews is not mission-critical, so [`eventually`](https://milvus.io/docs/consistency.md) consistent is fine here.

## Add a Vector Index

The vector index determines the vector **search algorithm** used to find the closest vectors in your data to the query a user submits.  Most vector indexes use different sets of parameters depending on whether the database is:
- **inserting vectors** (creation mode) - vs - 
- **searching vectors** (search mode) 

Scroll down the [docs page](https://milvus.io/docs/index.md) to see a table listing different vector indexes available on Milvus.  For example:
- FLAT - deterministic exhaustive search
- IVF_FLAT or IVF_SQ8 - Hash index (stochastic approximate search)
- HNSW - Graph index (stochastic approximate search)

Besides a search algorithm, we also need to specify a **distance metric**, that is, a definition of what is considered "close" in vector space.  In the cell below, the [`HNSW`](https://github.com/nmslib/hnswlib/blob/master/ALGO_PARAMS.md) search index is chosen.  Its possible distance metrics are one of:
- L2 - L2-norm
- IP - Dot-product
- COSINE - Angular distance

💡 Most use cases work better with normalized embeddings, in which case L2 is useless (every vector has length=1) and IP and COSINE are the same.  Only choose L2 if you plan to keep your embeddings unnormalized.

### Exercise #1 (2 min):
Create a collection named "movies".  Use the default AUTOINDEX.
> 💡 AUTOINDEX works on both Milvus and Zilliz Cloud (where it is the fastest!)

In [None]:
# Set the Milvus collection name.
COLLECTION_NAME = # TODO (exercise): code here

# Use no-schema Milvus client (uses flexible json key:value format).
# https://milvus.io/docs/using_milvusclient.md
mc = MilvusClient(uri="http://localhost")
mc.drop_collection(COLLECTION_NAME)
mc.create_collection(COLLECTION_NAME, 
                     EMBEDDING_LENGTH, 
                     #params=index_params # Omit params to use AUTOINDEX.
                    )

print(mc.describe_collection(COLLECTION_NAME))
print(f"Created collection: {COLLECTION_NAME}")

In [5]:
# Re-run create collection and add vector index specifying custom params.

# For vector length, use the embedding length from the embedding model.
print(f"Embedding length: {EMBEDDING_LENGTH}")

# Set the Milvus collection name.
COLLECTION_NAME = "movies"

# M = max number graph connections per layer. Large M = denser graph.
# Choice of M: 4~64, larger M for larger data and larger embedding lengths.
M = 16
# efConstruction = num_candidate_nearest_neighbors per layer. 
# Use Rule of thumb: int. 8~512, efConstruction = M * 2.
efConstruction = M * 2

# Show how to change the vector index algorithm parameters.
INDEX_PARAMS = dict({
    'M': M,               
    "efConstruction": efConstruction })
# Create the search index for local Milvus server.
index_params = {
    "index_type": "HNSW", 
    "metric_type": "COSINE", 
    "params": INDEX_PARAMS
    }

# Below example uses no-schema Milvus client (flexible json key:value format).
# https://milvus.io/docs/using_milvusclient.md
mc = MilvusClient(uri="http://localhost")
mc.drop_collection(COLLECTION_NAME)
mc.create_collection(
    COLLECTION_NAME, 
    EMBEDDING_LENGTH, 
    consistency_level="Eventually", 
    auto_id=True,
    overwrite=True,
    params=index_params # Use custom index params or omit.
    )

print(f"Created collection: {COLLECTION_NAME}")
print(mc.describe_collection(COLLECTION_NAME))

Embedding length: 768
Created collection: movies
{'collection_name': 'movies', 'auto_id': True, 'num_shards': 1, 'description': '', 'fields': [{'field_id': 100, 'name': 'id', 'description': '', 'type': 5, 'params': {}, 'element_type': 0, 'auto_id': True, 'is_primary': True}, {'field_id': 101, 'name': 'vector', 'description': '', 'type': 101, 'params': {'dim': 768}, 'element_type': 0}], 'aliases': [], 'collection_id': 445754962278875466, 'consistency_level': 3, 'properties': {}, 'num_partitions': 1, 'enable_dynamic_field': True}


## Read CSV data into a pandas dataframe

The data used in this notebook is the [IMDB large movie review dataset](https://ai.stanford.edu/~amaas/data/sentiment/) from the Stanford AI Lab. It is a conveniently processed 50,000 dataset (50:50 sampled ratio Positive/Negative reviews). This data has columns: movie_index, raw review text, and movie rating.

In [6]:
# 1. Download data from https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
# 2. Move .csv file to data/ folder.

# citation:  ACL 2011, @InProceedings{maas-EtAl:2011:ACL-HLT2011,
#   author    = {Maas, Andrew L.  and  Daly, Raymond E.  and  Pham, Peter T.  and  Huang, Dan  and  Ng, Andrew Y.  and  Potts, Christopher},
#   title     = {Learning Word Vectors for Sentiment Analysis},
#   booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies},
#   month     = {June},
#   year      = {2011},
#   address   = {Portland, Oregon, USA},
#   publisher = {Association for Computational Linguistics},
#   pages     = {142--150},
#   url       = {http://www.aclweb.org/anthology/P11-1015}
# }

In [8]:
# Read locally stored data.
filepath = "data/movie_data.csv"

df = pd.read_csv(f"{filepath}")

# Drop duplicates
df.drop_duplicates(keep='first', inplace=True)

# Change label column names.
df.columns = ['text', 'label_int']

# Map numbers to text 'Postive' and 'Negative' for sentiment labels.
df["label"] = df["label_int"].apply(imdb_utilities.sentiment_score_to_name)

# Split data into train/valid/test.
df, df_train, df_val, df_test = imdb_utilities.partition_dataset(df, smoke_test=False)
print(f"original df shape: {df.shape}")
print(f"df_train shape: {df_train.shape}, df_val shape: {df_val.shape}, df_test shape: {df_test.shape}")
assert df_train.shape[0] + df_val.shape[0] + df_test.shape[0] == df.shape[0]

# Inspect data.
print(f"Example text length: {len(df.text[0])}")
print(f"Example text: {df.text[0]}")
display(df.head(2))


original df shape: (100, 4)
df_train shape: (100, 4), df_val shape: (0, 4), df_test shape: (0, 4)
Example text length: 1113
Example text: The whole town of Blackstone is afraid, because they lynched Bret Dixon's brother - and he is coming back for revenge! At least that's what they think.<br /><br />A great Johnny Hallyday and a very interesting, early Mario Adorf star in this Italo-Western, obviously filmed in the Alps.<br /><br />Bret Dixon is coming back to Blackstone to investigate why his brother was lynched. He is a loner and gunslinger par excellance, everybody is afraid of him - the Mexican bandits (fighting the Gringos that took their land!) as well as the "decent" citizens that lynched Bret's brother. They lynched him, because they thought he stole their money instead of bringing it to Dallas to the safety of the bank there. But this is is only half the truth, as we find out in the course of this psychologically interesting western.<br /><br />But beware, it's kind of a depre

Unnamed: 0,movie_index,text,label_int,label
0,80,"The whole town of Blackstone is afraid, becaus...",1,Positive
1,84,This Harold Lloyd short wasn't really much; no...,0,Negative


In [9]:
# Check if approx. equal number training examples for each class.
class1 = df_train.loc[(df_train.label == "Positive"), :].copy()
class2 = df_train.loc[(df_train.label == "Negative"), :].copy()
print(f"Count samples positive: {class1.shape[0]}")
print(f"Count samples negative: {class2.shape[0]}")

Count samples positive: 50
Count samples negative: 50


In [10]:
# Uncomment this to create the small sample of data for github.
# df_small = df.head(100)[['text', 'label_int']].copy()
# display(df_small.head())
# df_small.to_csv("data/movie_data_small.csv", index=False)

## Chunking

Before embedding, it is necessary to decide your chunk strategy, chunk size, and chunk overlap.  In this demo, I will use:
- **Strategy** = Keep movie reveiws as single chunks unless they are too long.
- **Chunk size** = Use the embedding model's parameter `MAX_SEQ_LENGTH`
- **Overlap** = Rule-of-thumb 10-15%
- **Function** = Langchain's convenient `RecursiveCharacterTextSplitter` to split up long reviews recursively.


In [11]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

def recursive_splitter_wrapper(text, chunk_size):

    # Default chunk overlap is 10% chunk_size.
    chunk_overlap = np.round(chunk_size * 0.10, 0)

    # Use langchain's convenient recursive chunking method.
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        length_function=len,
    )
    chunks: List[str] = text_splitter.split_text(text)

    # Replace special characters with spaces.
    chunks = [text.replace("<br /><br />", " ") for text in chunks]

    return chunks

# Use recursive splitter to chunk text.
def imdb_chunk_text(batch_size, df, chunk_size):

    batch = df.head(batch_size).copy()
    print(f"chunk size: {chunk_size}")
    print(f"original shape: {batch.shape}")
    
    start_time = time.time()
    # 1. Change primary key type to string.
    batch["movie_index"] = batch["movie_index"].apply(lambda x: str(x))

    # 2. Truncate reviews to 512 characters.
    batch['chunk'] = batch['text'].apply(recursive_splitter_wrapper, chunk_size=chunk_size)
    # Explode the 'chunk' column to create new rows for each chunk.
    batch = batch.explode('chunk', ignore_index=True)
    print(f"new shape: {batch.shape}")

    # 3. Add embeddings as new column in df.
    review_embeddings = torch.tensor(encoder.encode(batch['chunk']))
    # Normalize embeddings to unit length.
    review_embeddings = F.normalize(review_embeddings, p=2, dim=1)
    # Quick check if embeddings are normalized.
    norms = np.linalg.norm(review_embeddings, axis=1)
    assert np.allclose(norms, 1.0, atol=1e-5) == True

    # 4. Convert embeddings to list of `numpy.ndarray`, each containing `numpy.float32` numbers.
    converted_values = list(map(np.float32, review_embeddings))
    batch['vector'] = converted_values

    # 5. Reorder columns for conveneince, so index first, labels at end.
    new_order = ["movie_index", "text", "chunk", "vector", "label_int", "label"]
    batch = batch[new_order]

    end_time = time.time()
    print(f"Chunking + embedding time for {batch_size} docs: {end_time - start_time} sec")

    # Inspect the batch of data.
    display(batch.head())
    assert len(batch.chunk[0]) <= MAX_SEQ_LENGTH-1
    assert len(batch.vector[0]) == EMBEDDING_LENGTH
    print(f"type embeddings: {type(batch.vector)} of {type(batch.vector[0])}")
    print(f"of numbers: {type(batch.vector[0][0])}")

    return batch

⚠️ **Demo batch size = 100 rows for demonstration purposes.**

This means the question results could be better with more data!

In [12]:
## Prepare df for insertion into Milvus index.

# Use the embedding model parameters to calculate chunk_size and overlap.
chunk_size = MAX_SEQ_LENGTH - HF_EOS_TOKEN_LENGTH

# Chunk a batch of data from pandas DataFrame and inspect it.
BATCH_SIZE = 100
batch = imdb_chunk_text(BATCH_SIZE, df, chunk_size)

# Chunking looks good, drop the original text column.
batch.drop(columns=["text"], inplace=True)

chunk size: 511
original shape: (100, 4)
new shape: (290, 5)
Chunking + embedding time for 100 docs: 8.375767946243286 sec


Unnamed: 0,movie_index,text,chunk,vector,label_int,label
0,80,"The whole town of Blackstone is afraid, becaus...","The whole town of Blackstone is afraid, becaus...","[-0.075508565, -0.022925325, 0.022277957, 0.03...",1,Positive
1,80,"The whole town of Blackstone is afraid, becaus...",Mexican bandits (fighting the Gringos that too...,"[0.0059213955, 0.0042556957, -0.028471153, 0.0...",1,Positive
2,80,"The whole town of Blackstone is afraid, becaus...",and definitely everybody is bad to the bone......,"[-0.004301766, -0.03188503, -0.0051136613, -0....",1,Positive
3,84,This Harold Lloyd short wasn't really much; no...,This Harold Lloyd short wasn't really much; no...,"[-0.007607854, -0.033714272, -0.0077492087, 0....",0,Negative
4,84,This Harold Lloyd short wasn't really much; no...,part was the last four or five minutes when th...,"[0.014139466, -0.04540589, 0.012334436, 0.0192...",0,Negative


type embeddings: <class 'pandas.core.series.Series'> of <class 'numpy.ndarray'>
of numbers: <class 'numpy.float32'>


### Exercise #2 (2 min):
Change the chunk_size and see what happens?  Model default is 511.

- What do your observations imply about changing the chunk_size and the number of vectors?
- How many vectors are there with chunk_size=256?

In [None]:
###############
## EXERCISE #1: Change chunk_size to 256 below.  How many chunks (vectors) does this create?
## ANSWER:  542
## BONUS:   Can you explain why the number of vectors changed from 290 to 542?  
##          Hint:  What is the default chunk overlap?  290 * (2 - 0.10) approx. equals 542.
###############
# Default chunk_size and overlap are calculated from embedding model parameters.
chunk_size =  # TODO (exercise): code here

# Chunk a batch of data from pandas DataFrame and inspect it.
batch = imdb_chunk_text( # TODO (exercise): code here )

In [14]:
# Don't forget to re-run using the better batch size!  

# Use the embedding model parameters to calculate chunk_size and overlap.
chunk_size = MAX_SEQ_LENGTH - HF_EOS_TOKEN_LENGTH

# Chunk a batch of data from pandas DataFrame and inspect it.
BATCH_SIZE = 100
batch = imdb_chunk_text(BATCH_SIZE, df, chunk_size)

# Chunking looks good, drop the original text column.
batch.drop(columns=["text"], inplace=True)

chunk size: 511
original shape: (100, 4)
new shape: (290, 5)
Chunking + embedding time for 100 docs: 8.245778799057007 sec


Unnamed: 0,movie_index,text,chunk,vector,label_int,label
0,80,"The whole town of Blackstone is afraid, becaus...","The whole town of Blackstone is afraid, becaus...","[-0.075508565, -0.022925325, 0.022277957, 0.03...",1,Positive
1,80,"The whole town of Blackstone is afraid, becaus...",Mexican bandits (fighting the Gringos that too...,"[0.0059213955, 0.0042556957, -0.028471153, 0.0...",1,Positive
2,80,"The whole town of Blackstone is afraid, becaus...",and definitely everybody is bad to the bone......,"[-0.004301766, -0.03188503, -0.0051136613, -0....",1,Positive
3,84,This Harold Lloyd short wasn't really much; no...,This Harold Lloyd short wasn't really much; no...,"[-0.007607854, -0.033714272, -0.0077492087, 0....",0,Negative
4,84,This Harold Lloyd short wasn't really much; no...,part was the last four or five minutes when th...,"[0.014139466, -0.04540589, 0.012334436, 0.0192...",0,Negative


type embeddings: <class 'pandas.core.series.Series'> of <class 'numpy.ndarray'>
of numbers: <class 'numpy.float32'>


## Insert data into Milvus

We can insert a batch of data directly from a pandas dataframe into Milvus.

🤔 TODO: This would be a good place to demonstrate Milvus' scalability by using Ray together with Milvus to run batches in parallel. I'll do this in a future tutorial.

In [15]:
# Insert a batch of data into the Milvus collection.

# Convert DataFrame to a list of dictionaries
dict_list = []
for _, row in batch.iterrows():
    dictionary = row.to_dict()
    dict_list.append(dictionary)

print("Start inserting entities")
start_time = time.time()
insert_result = mc.insert(
    COLLECTION_NAME,
    data=dict_list, 
    progress_bar=True)
end_time = time.time()
print(f"Milvus insert time for {batch.shape[0]} vectors: {end_time - start_time} seconds")

# After final entity is inserted, call flush to stop growing segments left in memory.
mc.flush(COLLECTION_NAME)


Start inserting entities


100%|██████████| 1/1 [00:00<00:00, 29.92it/s]

Milvus insert time for 290 vectors: 0.03497195243835449 seconds





## Run a Semantic Search

Now we can search all the movie review embeddings to find the `TOP_K` movie reviews with the closest embeddings to a user's query.
- In this example, we'll search for a movie recommendation for a medical doctor.

💡 The same model should always be used for consistency for all the embeddings.

In [16]:
# .load() not needed when using no-schema Milvus client.

# # Before conducting a search based on a query, you need to load the data into memory.
# mc.load()
# print("Loaded milvus collection into memory.")

## Ask a question about your data

So far in this demo notebook: 
1. Your custom data has been mapped into a vector embedding space
2. Those vector embeddings have been saved into a vector database

Next, you can ask a question about your custom data!

💡 With LLMs:
> **Query** is the generic term for user questions.  
A query is a list of multiple individual questions, up to maybe 1000 different questions!

> **Question** usually refers to a single user question.  
In our example below, the user question is "I'm a medical doctor, what movie should I watch?"

In [17]:
# Define a sample question about your data.
question = "I'm a medical doctor, what movie should I watch?"
query = [question]

# Inspect the length of the query.
QUERY_LENGTH = len(query[0])
print(f"query length: {QUERY_LENGTH}")

query length: 48


**Embed the question using the same embedding model you used earlier**

In order for vector search to work, the question itself should be embedded with the same model used to create the colleciton you want to search.

In [18]:
# Embed the query using same embedding model used to create the Milvus collection.
query_embeddings = torch.tensor(encoder.encode(query))
# Normalize embeddings to unit length.
query_embeddings = F.normalize(query_embeddings, p=2, dim=1)
# Quick check if embeddings are normalized.
norms = np.linalg.norm(query_embeddings, axis=1)
assert np.allclose(norms, 1.0, atol=1e-5) == True

# Convert the embeddings to list of list of np.float32.
query_embeddings = list(map(np.float32, query_embeddings))

# Inspect data.
print(type(query_embeddings), len(query_embeddings), type(query_embeddings[0]))
print(type(query_embeddings[0][0]) ) 

<class 'list'> 1 <class 'numpy.ndarray'>
<class 'numpy.float32'>


## Execute a vector search

Search Milvus using [PyMilvus API](https://milvus.io/docs/search.md).

💡 By their nature, vector searches are "semantic" searches.  For example, if you were to search for "leaky faucet": 
> **Traditional Key-word Search** - either or both words "leaky", "faucet" would have to match some text in order to return a web page or link text to the document.

> **Semantic search** - results containing words "drippy" "taps" would be returned as well because these words mean the same thing even though they are different words,

### Exercise #3 (2 min):
Search Milvus using the default search index.


In [19]:
# Run semantic vector search using your query and the vector database.
# Uses default search algorithm:  HNSW and top_k=10.
start_time = time.time()
results = mc.search(
    COLLECTION_NAME,
    data=query_embeddings, 
    )

elapsed_time = time.time() - start_time
print(f"Search time: {elapsed_time} sec")

# Inspect search result.
print(f"type: {type(results)}, count: {len(results[0])}")

Search time: 0.003979921340942383 sec
type: <class 'list'>, count: 10


In [20]:
# Re-run the search using custom settings.

# Return top k results with HNSW index.
TOP_K = 3
SEARCH_PARAMS = dict({
    # Re-use index param for num_candidate_nearest_neighbors.
    "ef": INDEX_PARAMS['efConstruction']
    })

# Run semantic vector search using your query and the vector database.
start_time = time.time()
results = mc.search(
    COLLECTION_NAME,
    data=query_embeddings, 
    search_params=SEARCH_PARAMS,
    output_fields=["movie_index", "chunk", "label"], 
    limit=TOP_K,
    consistency_level="Eventually",
    )

elapsed_time = time.time() - start_time
print(f"Search time: {elapsed_time} sec")

# Inspect search result.
print(f"type: {type(results)}, count: {len(results[0])}")


Search time: 0.0022897720336914062 sec
type: <class 'list'>, count: 3


## Assemble and inspect the search result

The search result is in the variable `result[0]` of type `'pymilvus.orm.search.SearchResult'`.  

In [21]:
## Results returned from MilvusClient are in the form list of lists of dicts.

# Get the movie_indexes, review texts, and labels.
distances = []
texts = []
movie_indexes = []
labels = []
for result in results[0]:
    distances.append(result['distance'])
    texts.append(result['entity']['chunk'])
    movie_indexes.append(result['entity']['movie_index'])
    labels.append(result['entity']['label'])

# Assemble all the results in a zipped list.
formatted_results = list(zip(distances, movie_indexes, texts, labels))

In [22]:
# Print the results.
# k: distance, movie_index, label, review text

i = 0
for row in formatted_results:
    print(f"{i}: {np.round(row[0],3)}, {row[1]}, {row[3]}, {row[2][:100]}")
    i += 1

#1:  2006, Serum, 
# 0: 0.541, 931, Negative, Dr. K(David H Hickey)has been trying to master a formula that would end all disease and handicaps, b
# 1: 0.54, 20682, Positive, is not a horror movie, although it does contain some violent scenes, but is rather a comedy. A satir
# 2: 0.535, 12529, Positive, a good movie with a real good story. The fact that there are so many other big stars who


0: 0.541, 56, Negative, Dr. K(David H Hickey)has been trying to master a formula that would end all disease and handicaps, b
1: 0.54, 44, Positive, is not a horror movie, although it does contain some violent scenes, but is rather a comedy. A satir
2: 0.535, 67, Positive, a good movie with a real good story. The fact that there are so many other big stars who all also ha


## Try another question

This time just add the words **only good movies** to the question, see if the answers are any different?  

For semantically different questions, we expect the answers to be different.

To make the code easier to read, this time I'll just use the convenience function I defined in `imdb_utilities.py`.

In [23]:
# Take as input a user question and conduct semantic vector search using the question.
question = "I'm a medical doctor, what movie should I watch?"
new_question = "I'm a medical doctor, suggest only good movies to watch?"
new_results = \
    imdb_utilities.mc_search_imdb([new_question],
                                   encoder,
                                   mc,
                                   SEARCH_PARAMS, 3, 
                                   milvus_client=True,
                                   COLLECTION_NAME=COLLECTION_NAME,
                                   )

# Print the results.
# k: distance, movie_index, label, review text
i = 0
for row in new_results:
    print(f"{i}: {np.round(row[0],3)}, {row[1]}, {row[3]}, {row[2][:100]}")
    i += 1

# As expected, new_question answers are slightly different!
# 0: 0.562, 45719, Positive, the stories but helps Malkovich to provoke some thought.<br /><br />I'd say it is worth seeing and t
# 1: 0.562, 21791, Positive, to add that the dog (who's a pretty darn good actor himself!) comes in a close second.<br /><br />Al
# 2: 0.561, 12529, Positive, a good movie with a real good story. The fact that there are so many other big

0: 0.561, 67, Positive, a good movie with a real good story. The fact that there are so many other big stars who all also ha
1: 0.56, 13, Positive, the stories but helps Malkovich to provoke some thought. I'd say it is worth seeing and the best of 
2: 0.549, 12, Positive, the mini-bio on Woody Strode here as a primer: http://imdb.com/name/nm0834754/bio  The film does a g


In [24]:
# Shut down and cleanup the milvus server.
default_server.stop()
default_server.cleanup()

In [25]:
# Props to Sebastian Raschka for this handy watermark.
# !pip install watermark

%load_ext watermark
%watermark -a 'Christy Bergman' -v -p torch,transformers,milvus,pymilvus,langchain --conda

Author: Christy Bergman

Python implementation: CPython
Python version       : 3.10.12
IPython version      : 8.15.0

torch       : 2.0.1
transformers: 4.34.1
milvus      : 2.3.3
pymilvus    : 2.3.3
langchain   : 0.0.322

conda environment: py310

