# Building AI-powered search in Amazon DocumentDB Vector Search using Amazon Bedrock and DocumentDB Vector Search
_**Using Bedrock Titan embedding model and DocumentDB `Vector Search` for similarity search on Auto product catalog**_

---

---

## Contents


1. [Background](#Background)
1. [Setup](#Setup)
1. [Bedrock Model Call Preparation](#Bedrock-model-call-prepration)
1. [DocumentDB](#DocumentDB-vector-search)
1. [Evaluate Search Results](#Evaluate-DocumentDB-vector-Search-Results)

## Background

In this notebook, we'll build the core components of a textually similar Products. Often people don't know what exactly they are looking for and in that case they just type an item description and hope it will retrieve similar items.

One of the core components of searching textually similar items is a fixed length sentence/word embedding i.e. a  “feature vector” that corresponds to that text. The reference word/sentence embedding typically are generated offline and must be stored so they can be efficiently searched. In this use case we are using Amazon Bedrock Titan(https://aws.amazon.com/cn/bedrock/titan/).

To enable efficient searches for textually similar items, we'll use Amazon Bedrock Titan to generate fixed length sentence embeddings i.e “feature vectors” and use the Nearest Neighbor search in Amazon DocumentDB (with MongoDB compatibility) using the Vector Search. DocumentDB Vector Search lets you store and search for points in vector space and find the "nearest neighbors" for those points. Use cases include recommendations (for example, an "other songs you might like" feature in a music application), image recognition, and fraud detection.

Here are the steps we'll follow to build textually similar items: After some initial setup, we'll call Bedrock Titan Embedding model. Then generate feature vectors for Auto products from the website(https://mechanicbase.com/cars/different-car-models-types) dataset. Those feature vectors will be stored in DocumentDB Vector Search vector datatype. Next, we'll explore some sample text queries, and visualize the results.

## Setup
Install required python libraries for the workshop


In [None]:
!pip install -U pymongo  tqdm boto3 requests scikit-image

### Downloading Auto demo data

The dataset itself consists of 29 high-resolution images, each depicting a type of car. Each of the images has five textual description. 

**Downloading Auto Demo data**: Data originally from here: https://mechanicbase.com/cars/different-car-models-types

In [None]:
import urllib.request
import os
import json
import boto3
from multiprocessing import cpu_count
from tqdm.contrib.concurrent import process_map
filename = 'metadata.json'

with open(filename) as json_file:
    results = json.load(json_file)
if not os.path.exists(filename):
   print ("metadata.json file not exits")
results[0]
print(results[0])

## Bedrock Model Call Preparation
prepare for Bedrock Titan model call

In [None]:
# for bedrock model call
%pip install langchain==0.0.305 --force-reinstall

In [None]:
#for bedrock model call
import json
import os
import sys

import boto3

bedrock_client = boto3.client("bedrock-runtime", region_name="us-east-1")

In [None]:
# for bedrock model call
from langchain.embeddings import BedrockEmbeddings
from langchain.llms.bedrock import Bedrock

bedrock_embeddings = BedrockEmbeddings(model_id="amazon.titan-embed-text-v1",
                                       client=bedrock_client)

In [None]:
# for Bedrock Embedding model call

def generate_embeddings(data):
    r = bedrock_embeddings.embed_query(data)
    return r


## DocumentDB Vector Search

vector search for Amazon DocumentDB (with MongoDB compatibility), a new built-in capability that lets you store, index, and search millions of vectors with millisecond response times within your document database.

One of the key benefits of using pgvector is that it allows you to perform similarity searches on large datasets quickly and efficiently. This is particularly useful in industries like e-commerce, where businesses need to be able to quickly search through large product catalogs to find the items that best match a customer's preferences. It supports exact and approximate nearest neighbor search, L2 distance, inner product, and cosine distance.

To further optimize your searches, you can also use DocumentDB Vector Search's indexing features. By creating indexes on your vector data, you can speed up your searches and reduce the amount of time it takes to find the nearest neighbors to a given vector.

In this step we'll get all the translated product descriptions of auto dataset and store those embeddings into DocumentDB Vector Search vector type.

In [None]:
# Set up a connection to your Amazon DocumentDB (MongoDB compatibility) cluster and creating the database
import pymongo

client = pymongo.MongoClient(
"docdb-vector-search-lab.cluster-ckutjdcz3cie.us-east-1.docdb.amazonaws.com:27017",
username="masteruser",
password="Password1",
retryWrites=False,
tls='true',
tlsCAFile='global-bundle.pem')
db = client.similarity6
collection = db.products

In [None]:
import pymongo
import boto3 
import json 


for x in results:
    description1 = ' '.join(x.get('descriptions', []))
    vector = generate_embeddings(description1)
    record = { "description": ' '.join(x.get('descriptions', [])),
          "url": x.get('url'),
          "descriptions_embeddings": vector}
    print("record",record)
    rec_id1 = collection.insert_one(record)  

collection.create_index ([("descriptions_embeddings","vector")], vectorOptions={
"lists": 1,
"similarity": "euclidean",
"dimensions": 1536}) 
client.close()

#print ("Vector embeddings has been successfully loaded into DocumentDB") 

## Evaluate DocumentDB vector Search Results

In this step we will use SageMaker realtime inference to generate embeddings for the query and use the embeddings to search the DocumentDB to retrive the nearest neighbours and retrive the relevent product images.


In [None]:
from skimage import io
import matplotlib.pyplot as plt
import requests

def similarity_search(search_text):
    client = pymongo.MongoClient(
    "docdb-vector-search-lab.cluster-ckutjdcz3cie.us-east-1.docdb.amazonaws.com:27017",
    username="masteruser",
    password="Password1",
    retryWrites=False,
    tls='true',
    tlsCAFile='global-bundle.pem')
    db = client.similarity6
    collection = db.products
    
    data = {"inputs": search_text}
    print(data)
    res1 = generate_embeddings(data['inputs'])
    
    query = {"vectorSearch" : {"vector" : res1, "path": "descriptions_embeddings", "similarity": "euclidean", "k": 2}}
    projection = {
    "_id":0,
    "url":1,
    "description":1,
    "descriptions_embeddings": 1}
    r = collection.aggregate([{'$search': query},{"$project": projection}])
    
  
    urls = []
    plt.rcParams["figure.figsize"] = [7.50, 3.50]
    plt.rcParams["figure.autolayout"] = True

    for x in r:
        url = x["url"].split('?')[0]
        urldata = requests.get(url).content
        a = io.imread(url)
        plt.imshow(a)
        plt.axis('off')
        plt.show()


    client.close()

Using the above function `similarity_search` , lets do some search

In [None]:
similarity_search("Camping car")