# Vector Search using vCore-based Azure Cosmos DB for MongoDB

This notebook demonstrates using an Azure OpenAI embedding model to vectorize documents already stored in Azure Cosmos DB API for MongoDB, storing the embedding vectors and the creation of a vector index. Lastly, the notebook will demonstrate how to query the vector index to find similar documents.

This lab expects the data that was loaded in Lab 2.

In [10]:
import os
import pymongo
import time
import json
from openai import AzureOpenAI
from dotenv import load_dotenv
from tenacity import retry, wait_random_exponential, stop_after_attempt
import certifi

## Load settings

This lab expects the `.env` file that was created in Lab 1 to obtain the connection string for the database.

Add the following entries into the `.env` file to support the connection to Azure OpenAI API, replacing the values for `<your key>` and `<your endpoint>` with the values from your Azure OpenAI API resource.

```text
AOAI_ENDPOINT="<your endpoint>"
AOAI_KEY="<your key>""
```

In [11]:
load_dotenv()
CONNECTION_STRING = os.environ.get("DB_CONNECTION_STRING")
EMBEDDINGS_DEPLOYMENT_NAME = "text-embedding-ada-002"
COMPLETIONS_DEPLOYMENT_NAME = "gpt-35-turbo"
AOAI_ENDPOINT = os.environ.get("AOAI_ENDPOINT")
AOAI_KEY = os.environ.get("AOAI_KEY")
AOAI_API_VERSION = "2023-05-15"

## Establish connectivity to the database

In [12]:
db_client = pymongo.MongoClient(CONNECTION_STRING, tlsCAFile=certifi.where())
# Create database to hold cosmic works data
# MongoDB will create the database if it does not exist
db = db_client.cosmic_works

## Establish Azure OpenAI connectivity

In [13]:
ai_client = AzureOpenAI(
    azure_endpoint = AOAI_ENDPOINT,
    api_version = AOAI_API_VERSION,
    api_key = AOAI_KEY
    )

## Vectorize and store the embeddings in each document

The process of creating a vector embedding field on each document only needs to be done once. However, if a document changes, the vector embedding field will need to be updated with an updated vector.

In [14]:
@retry(wait=wait_random_exponential(min=1, max=20), stop=stop_after_attempt(3))
def generate_embeddings(text: str):
    '''
    Generate embeddings from string of text using the deployed Azure OpenAI API embeddings model.
    This will be used to vectorize document data and incoming user messages for a similarity search with
    the vector index.
    '''
    response = ai_client.embeddings.create(input=text, model=EMBEDDINGS_DEPLOYMENT_NAME)
    embeddings = response.data[0].embedding
    time.sleep(0.5) # rest period to avoid rate limiting on AOAI
    return embeddings

In [15]:
# demonstrate embeddings generation using a test string
test = "hello, world"
print(generate_embeddings(test))

[-0.016783414, -0.006727666, -0.027430676, -0.046463147, -0.01095277, 0.01014025, -0.013910343, -0.0048393696, -0.018681461, -0.0283667, 0.028990716, 0.019799488, -0.021710536, -0.006327906, 0.009522735, 0.0066171633, 0.017589433, -0.014456357, 0.011784791, 0.018460454, -0.012330804, -1.1311802e-05, 0.009295229, -0.009893244, -0.009581236, -0.016315402, 0.006864169, -0.016874416, 0.024388602, -0.037882935, 0.00066789147, 0.0033865834, -0.016081396, -0.0064254086, 0.011115274, -0.011895293, 0.00094739837, -0.02756068, 0.02917272, -0.01138828, 0.0024960616, -0.0070396736, 0.0040821005, -0.013715338, -0.032708805, 0.012623311, 0.008768716, -0.015080372, 0.0042706053, 0.02269856, 0.021606533, 0.0015787265, -0.024401601, -0.0019939241, -0.013097823, 0.008671214, -0.035360873, 0.014703362, 0.019994494, -0.020579508, 0.01582139, 0.0038155941, -0.025324624, 0.0121748, -0.010153251, 0.010328755, 0.015990395, 0.008632213, -0.019539481, 0.01458636, 0.020735512, 0.018993469, -0.0054146335, -0.0077

### Vectorize and update all documents in the Cosmic Works database

In [22]:
def add_collection_content_vector_field(collection_name: str):
    """
    Add a new field to the collection to hold the vectorized content of each document.
    """
    collection = db[collection_name]
    # doc = collection.find_one({"_id": "3F105575-8677-42F9-8E1F-76E4B450F136"})
    # print(doc)

    # if "contentVector" in doc:
    #     del doc["contentVector"]

    # # generate embeddings for the document string representation
    # content = json.dumps(doc, default=str)
    # content_vector = generate_embeddings(content)
    # print("content: ", content)

    # doc["contentVector"] = "Makarena"

    # bulk_operations = []
    # bulk_operations.append(
    #     pymongo.UpdateOne(
    #         {"_id": doc["_id"]}, {"$set": {"contentVector": "content_vector"}}, upsert=True
    #     )
    # )
    # collection.bulk_write(bulk_operations)
    
    bulk_operations = []
    cursor = collection.find({}, batch_size=20)
    print("Batch size: 20")
    for doc in cursor:
        # remove any previous contentVector embeddings
        if "contentVector" in doc:
            del doc["contentVector"]

        # generate embeddings for the document string representation
        content = json.dumps(doc, default=str)
        print("content: ", content)
        content_vector = generate_embeddings(content)

        bulk_operations.append(pymongo.UpdateOne(
            {"_id": doc["_id"]},
            {"$set": {"contentVector": content_vector}},
            upsert=True
        ))
    # execute bulk operations
    collection.bulk_write(bulk_operations)
    cursor.close()

In [7]:
# Add vector field to products documents - this will take approximately 3-5 minutes due to rate limiting
add_collection_content_vector_field("products")

  return Cursor(self, *args, **kwargs)


AutoReconnect: c.copilot-cluster.mongocluster.cosmos.azure.com:10260: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1000) (configured timeouts: connectTimeoutMS: 20000.0ms)

In [None]:
# Add vector field to customers documents - this will take approximately 1-2 minutes due to rate limiting
add_collection_content_vector_field("customers")

In [20]:
# Add vector field to customers documents - this will take approximately 15-20 minutes due to rate limiting
add_collection_content_vector_field("sales")

  return Cursor(self, *args, **kwargs)


content:  {"_id": "00500AA1-3E9D-4E83-9C21-07B0AF482B3F", "customerId": "72F4BF6F-6BF5-4E22-89C1-29505FD74710", "orderDate": "2012-03-30 00:00:00", "shipDate": "2012-04-06 00:00:00", "details": [{"sku": "BK-R93R-62", "name": "Road-150 Red, 62", "price": 2146.962, "quantity": 1}, {"sku": "FR-R38B-52", "name": "LL Road Frame - Black, 52", "price": 178.5808, "quantity": 1}]}
content:  {"_id": "00CC8882-4B98-4273-BDD3-732CB8F5A2E0", "customerId": "D6A5D690-7A0C-42F4-8557-B1FE8326795F", "orderDate": "2013-05-30 00:00:00", "shipDate": "2013-06-06 00:00:00", "details": [{"sku": "CA-1098", "name": "AWC Logo Cap", "price": 5.394, "quantity": 5}, {"sku": "FR-R92R-44", "name": "HL Road Frame - Red, 44", "price": 858.9, "quantity": 2}, {"sku": "FR-R38B-52", "name": "LL Road Frame - Black, 52", "price": 202.332, "quantity": 3}, {"sku": "BK-R89B-58", "name": "Road-250 Black, 58", "price": 1466.01, "quantity": 3}, {"sku": "HL-U509-B", "name": "Sport-100 Helmet, Blue", "price": 15.7455, "quantity": 7}

CursorNotFound: Cursor 4317912485125541 not found, full error: {'ok': 0.0, 'errmsg': 'Cursor 4317912485125541 not found', 'code': 43, 'codeName': 'CursorNotFound'}

In [88]:
# Create the products vector index
db.command({
  'createIndexes': 'products',
  'indexes': [
    {
      'name': 'VectorSearchIndex',
      'key': {
        "contentVector": "cosmosSearch"
      },
      'cosmosSearchOptions': {
        'kind': 'vector-ivf',
        'numLists': 1,
        'similarity': 'COS',
        'dimensions': 1536
      }
    }
  ]
})

# Create the customers vector index
db.command({
  'createIndexes': 'customers',
  'indexes': [
    {
      'name': 'VectorSearchIndex',
      'key': {
        "contentVector": "cosmosSearch"
      },
      'cosmosSearchOptions': {
        'kind': 'vector-ivf',
        'numLists': 1,
        'similarity': 'COS',
        'dimensions': 1536
      }
    }
  ]
})

# Create the sales vector index
db.command({
  'createIndexes': 'sales',
  'indexes': [
    {
      'name': 'VectorSearchIndex',
      'key': {
        "contentVector": "cosmosSearch"
      },
      'cosmosSearchOptions': {
        'kind': 'vector-ivf',
        'numLists': 1,
        'similarity': 'COS',
        'dimensions': 1536
      }
    }
  ]
})

{'raw': {'defaultShard': {'numIndexesBefore': 1,
   'numIndexesAfter': 2,
   'createdCollectionAutomatically': False,
   'ok': 1}},
 'ok': 1}

## Use vector search in vCore-based Azure Cosmos DB for MongoDB

Now that each document has its associated vector embedding and the vector indexes have been created on each collection, we can now use the vector search capabilities of vCore-based Azure Cosmos DB for MongoDB.

In [76]:
def vector_search(collection_name, query, num_results=3):
    """
    Perform a vector search on the specified collection by vectorizing
    the query and searching the vector index for the most similar documents.

    returns a list of the top num_results most similar documents
    """
    collection = db[collection_name]
    query_embedding = generate_embeddings(query)    
    pipeline = [
        {
            '$search': {
                "cosmosSearch": {
                    "vector": query_embedding,
                    "path": "contentVector",
                    "k": num_results
                },
                "returnStoredSource": True }},
        {'$project': { 'similarityScore': { '$meta': 'searchScore' }, 'document' : '$$ROOT' } }
    ]
    results = collection.aggregate(pipeline)
    return results

def print_product_search_result(result):
    '''
    Print the search result document in a readable format
    '''
    print(f"Similarity Score: {result['similarityScore']}")  
    print(f"Name: {result['document']['name']}")   
    print(f"Category: {result['document']['categoryName']}")
    print(f"SKU: {result['document']['categoryName']}")
    print(f"_id: {result['document']['_id']}\n")

In [77]:
query = "What bikes do you have?"
results = vector_search("products", query, num_results=4)
for result in results:
    print_product_search_result(result)   

Similarity Score: 0.767089582470373
Name: Road-750 Black, 48
Category: Bikes, Road Bikes
SKU: Bikes, Road Bikes
_id: 2595584F-EA4E-4D45-948E-99A17AF8C519

Similarity Score: 0.7650260697297464
Name: Road-550-W Yellow, 40
Category: Bikes, Road Bikes
SKU: Bikes, Road Bikes
_id: 3A70EDD4-6C8C-44AA-A13D-49D0F6058699

Similarity Score: 0.7647035349817141
Name: Mountain-300 Black, 48
Category: Bikes, Mountain Bikes
SKU: Bikes, Mountain Bikes
_id: E8767BC9-D6BA-47FC-9842-3511468869B6

Similarity Score: 0.7634958538267828
Name: Road-550-W Yellow, 48
Category: Bikes, Road Bikes
SKU: Bikes, Road Bikes
_id: 26E8185C-782A-4B48-87FA-1E715E3825FB



In [78]:
query = "What do you have that is yellow?"
results = vector_search("products", query, num_results=4)
for result in results:
    print_product_search_result(result)   

Similarity Score: 0.7423481209847724
Name: Road-550-W Yellow, 48
Category: Bikes, Road Bikes
SKU: Bikes, Road Bikes
_id: 26E8185C-782A-4B48-87FA-1E715E3825FB

Similarity Score: 0.7406564796362327
Name: Road-350-W Yellow, 40
Category: Bikes, Road Bikes
SKU: Bikes, Road Bikes
_id: 9E5C74FD-F685-45AE-A799-D67EFB5C28A1

Similarity Score: 0.7381368591622282
Name: Road-550-W Yellow, 40
Category: Bikes, Road Bikes
SKU: Bikes, Road Bikes
_id: 3A70EDD4-6C8C-44AA-A13D-49D0F6058699

Similarity Score: 0.7364529842520398
Name: LL Touring Frame - Yellow, 62
Category: Components, Touring Frames
SKU: Components, Touring Frames
_id: 91AA100C-D092-4190-92A7-7C02410F04EA



## Use vector search results in a RAG pattern with Chat GPT-3.5

In [79]:
# A system prompt describes the responsibilities, instructions, and persona of the AI.
system_prompt = """
You are a helpful, fun and friendly sales assistant for Cosmic Works, a bicycle and bicycle accessories store. 
Your name is Cosmo.
You are designed to answer questions about the products that Cosmic Works sells.

Only answer questions related to the information provided in the list of products below that are represented
in JSON format.

If you are asked a question that is not in the list, respond with "I don't know."

List of products:
"""

In [85]:
def rag_with_vector_search(question: str, num_results: int = 3):
    """
    Use the RAG model to generate a prompt using vector search results based on the
    incoming question.  
    """
    # perform the vector search and build product list
    results = vector_search("products", question, num_results=num_results)
    product_list = ""
    for result in results:
        if "contentVector" in result["document"]:
            del result["document"]["contentVector"]
        product_list += json.dumps(result["document"], indent=4, default=str) + "\n\n"


    # print("product list: ", product_list)
    # generate prompt for the LLM with vector results
    formatted_prompt = system_prompt + product_list

    # prepare the LLM request
    messages = [
        {"role": "system", "content": formatted_prompt},
        {"role": "user", "content": question}
    ]

    completion = ai_client.chat.completions.create(messages=messages, model=COMPLETIONS_DEPLOYMENT_NAME)
    return completion.choices[0].message.content

In [86]:
print(rag_with_vector_search("What bikes do you have?", 5))

We have the following bikes:
1. Road-750 Black, 48
2. Road-550-W Yellow, 40
3. Mountain-300 Black, 48
4. Road-550-W Yellow, 48
5. Road-650 Black, 48

Let me know if you need any information about any of these bikes!


In [87]:
print(rag_with_vector_search("What are the names and skus of yellow products?", 5))

The yellow products we have are:

1. Road-550-W Yellow, 48 (sku: BK-R64Y-48)
2. Road-550-W Yellow, 40 (sku: BK-R64Y-40)
3. ML Road Frame-W - Yellow, 48 (sku: FR-R72Y-48)
4. Road-350-W Yellow, 48 (sku: BK-R79Y-48)
5. Touring-3000 Yellow, 62 (sku: BK-T18Y-62)
