# Vector Search Basics

# Create the Vector Search Index
Create a schema and load the table with data, including embeddings we generate through the OpenAI Embedding API.

This notebook uses a table called `products_table`.

## Imports

In [1]:
from cassandra.cluster import Cluster
from cassandra.auth import PlainTextAuthProvider
from cassandra.query import dict_factory
from cassandra.query import SimpleStatement
import openai
import pandas as pd

## Keys & Environment Variables

In [2]:
import os
from dotenv import load_dotenv

load_dotenv()

# Astra DB
ASTRA_DB_KEYSPACE = os.environ['ASTRA_DB_KEYSPACE']
ASTRA_DB_SECURE_BUNDLE_PATH = os.environ['ASTRA_DB_SECURE_BUNDLE_PATH']
ASTRA_DB_APPLICATION_TOKEN = os.environ['ASTRA_DB_APPLICATION_TOKEN']

# OpenAI Token
openai_api_key = os.environ['OPENAI_API_KEY']
openai.api_key = openai_api_key

## Select a model to compute embeddings

Embeddings are numerical representations of concepts converted to number sequences, which make it easy for computers to understand the relationships between those concepts.

This new embedding model from openAI - `text-embedding-ada-002` - replaces five separate models for text search, text similarity, and code search, and outperforms our previous most capable model, Davinci, at most tasks, while being priced 99.8% lower.

In [3]:
model_id = "text-embedding-ada-002"

## Connect to Astra DB

In [4]:
cloud_config= {
  'secure_connect_bundle': ASTRA_DB_SECURE_BUNDLE_PATH
}
auth_provider = PlainTextAuthProvider('token', ASTRA_DB_APPLICATION_TOKEN)
cluster = Cluster(cloud=cloud_config, auth_provider=auth_provider)
session = cluster.connect()
session.set_keyspace(ASTRA_DB_KEYSPACE)
session

<cassandra.cluster.Session at 0x1116c97c0>

## Database Schema

> **Note:** The following blocks only need be run when you create the schema. Otherwise use them at your dicretion.

Note the data type `vector` in the schema below.

### Drop Schema

> **Note:** Only run this block when you want to DROP the schema.

In [5]:
# only use this to DROP the schema
session.execute(f"""DROP INDEX IF EXISTS openai_desc""")
session.execute(f"""DROP INDEX IF EXISTS minilm_desc""")

session.execute(f"""DROP TABLE IF EXISTS products_table""")

<cassandra.cluster.ResultSet at 0x132891fa0>

### Create Schema

> **Note:** Only run this block when you want to CREATE the schema.

In [6]:
# CREATE the schema

session.execute(f"""CREATE TABLE IF NOT EXISTS products_table
(product_id int,
 chunk_id int,

 product_name text,
 description text,
 price text,

 openai_description_embedding vector<float, 1536>,
 minilm_description_embedding vector<float, 384>,

 PRIMARY KEY (product_id, chunk_id))""")

# # Create Index
session.execute(f"""CREATE CUSTOM INDEX IF NOT EXISTS openai_desc ON products_table (openai_description_embedding) USING 'org.apache.cassandra.index.sai.StorageAttachedIndex'""")
session.execute(f"""CREATE CUSTOM INDEX IF NOT EXISTS minilm_desc ON products_table (minilm_description_embedding) USING 'org.apache.cassandra.index.sai.StorageAttachedIndex'""")

<cassandra.cluster.ResultSet at 0x1328599a0>

## Create embeddings and Store in DB 

### Read CSV file

In [7]:
products_list = pd.read_csv('ProductDataset.csv')
products_list

Unnamed: 0,product_id,product_name,description,price
0,37162,Canon PIXMA Photo All-In-One Printer - MP620,Canon PIXMA Photo All-In-One Printer - MP620/ ...,$149.00
1,37174,TiVo HD XL Black Digital Video Recorder - TCD6...,TiVo HD XL Black Digital Video Recorder - TCD6...,$599.00
2,37181,Apple 8GB Black 2nd Generation iPod Touch - MB...,Apple 8GB Black 2nd Generation iPod Touch - MB...,$229.00
3,37182,Apple 16GB Black 2nd Generation iPod Touch - M...,Apple 16GB Black 2nd Generation iPod Touch - M...,$299.00
4,37183,Apple 32GB Black 2nd Generation iPod Touch - M...,Apple 32GB Black 2nd Generation iPod Touch - M...,$399.00
...,...,...,...,...
167,39088,Logitech Cordless Desktop Wave Keyboard And Mo...,Logitech Cordless Desktop Wave Keyboard And Mo...,$79.00
168,39090,Mitsubishi DLP Black TV Stand - MBS73V,Mitsubishi DLP Black TV Stand - MBS73V/ Matchi...,$549.00
169,39175,Logitech Digital Precision PC Gaming Headset -...,Logitech Digital Precision PC Gaming Headset -...,$49.00
170,39176,Logitech 2.1 Multimedia Silver Speaker System ...,Logitech 2.1 Multimedia Silver Speaker System ...,


### Create embeddings and insert into database

> **Note:** You only need to run this block once after creating the database schema.

In [8]:
# Create Embedding for each conversation row, save them to the database
for id, row in products_list.iterrows():

  # break Description data into chunks of 2500 characters
  text_chunk_length = 2500
  text_chunks = [row.description[i:i + text_chunk_length] for i in range(0, len(row.description), text_chunk_length)]
  
  for chunk_id, chunk in enumerate(text_chunks):
    
    # Append Price to Description Data 
    pricevalue = row.price if isinstance(row.price, str) else ""
    full_chunk = f"{chunk} price: {pricevalue}"

    # Create an embedding using OpenAI API
    embedding = openai.Embedding.create(input=full_chunk, model=model_id)['data'][0]['embedding']

    # Insert Data and Embedding into database
    query = SimpleStatement(
                f"""
                INSERT INTO products_table
                (product_id, chunk_id, product_name, description, price, openai_description_embedding)
                VALUES (%s, %s, %s, %s, %s, %s)
                """
            )
    session.execute(query, (row.product_id, chunk_id, row.product_name, row.description, pricevalue, embedding))



---


# Use the index

In the steps up to this point, we have been creating a schema and loading the table with data, including embeddings we generated through the OpenAI Embedding API.
Now we are going to query that table and use the results to give ChatGPT some context to support it's response.

## Convert a query string into a text embedding to use as part of the query

This is where the real fun starts.  Provide a question or request to be used as the query.  The source sample database is mostly consumer electronics and appliances, so imagine you're talking to a customer service rep at Best Buy or another electronics store.

Here we use the same API that we used to calculate embeddings for each row in the database, but this time we are using your input question to calculate a vector to use in a query.

In [9]:
customer_input = "What camera is suitable for a beginner photgrapher"
embedding = openai.Embedding.create(input=customer_input, model=model_id)['data'][0]['embedding']
display(embedding)

[0.0006933565600775182,
 0.010056542232632637,
 -0.017224783077836037,
 -0.012419698759913445,
 -0.02759641222655773,
 0.01915469393134117,
 -0.005514031276106834,
 -0.044899966567754745,
 -0.0019381162710487843,
 -0.024694982916116714,
 -0.0050282711163163185,
 0.011435050517320633,
 -0.008802756667137146,
 -0.008783063851296902,
 0.00614092405885458,
 0.004444046411663294,
 0.013509375974535942,
 0.0012111174874007702,
 0.0009518268052488565,
 -0.013982007279992104,
 -0.011835473589599133,
 0.023854749277234077,
 0.011710751801729202,
 -0.0033674975857138634,
 -0.012045532464981079,
 0.0014096882659941912,
 0.019916154444217682,
 -0.0024238761980086565,
 -0.008513926528394222,
 -0.011796087957918644,
 0.03022214211523533,
 -0.015229228883981705,
 -0.01998179778456688,
 -0.03985856845974922,
 0.003685867181047797,
 -0.006390368100255728,
 0.017014725133776665,
 0.010923032648861408,
 0.014848497696220875,
 0.0011060883989557624,
 0.031613778322935104,
 0.01891837827861309,
 -0.0027520

## Find the top 5 results using ANN Similarity

Let's take a look at what a query against a vector index could look like.  The query vector has the same dimensions (number of entries in the list) as the embeddings we generated a few steps ago for each row in the database.

In [10]:
query = SimpleStatement(
    f"""
    SELECT *
    FROM products_table
    ORDER BY openai_description_embedding ANN OF {embedding} LIMIT 5;
    """
    )
#display(query)

results = session.execute(query)
top_5_products = results._current_rows

for row in top_5_products:
  print(f"""{row.product_id}, {row.product_name}, {row.description}\n""")

38388, Canon Digital EOS Rebel XS Starter Kit - 9320A010, Canon Digital EOS Rebel XS Starter Kit - 9320A010/ Includes Digital Gadget Bag 200DG, Battery Pack NB-2LH, 58mm UV Haze Filter

37469, Canon PowerShot Black 14.7 Megapixel Digital Camera - G10, Canon PowerShot Black 14.7 Megapixel Digital Camera - G10/ 14.7 Megapixel/ 5x Optical Zoom/ Optical Image Stabilizer/ DIGIC 4 Image Processor/ Face Detection Self-Timer/ Intelligent Contrast Correction/ 3.0' PureColor LCD/ Print/Share Button/ Black Finish

37404, Canon Black EOS 50D Digital SLR Camera With 28-135MM Lens - 50D28135, Canon Black EOS 50D Digital SLR Camera With 28-135MM Lens - 50D28135/ 15.1 Megapixel CMOS Sensor/ DIGIC 4 Image Processor/ 3.0' Clear View LCD/ 9 Cross-Type High-Precision Sensors/ Enhanced Live View/ EOS Integrated Cleaning System/ Creative Auto/ HDMI Output/ Lens Peripheral Illumination Correction/ Black Finish

37403, Canon Black EOS 50D Digital SLR Camera Body - EOS50DBODY, Canon Black EOS 50D Digital SLR C

## Ask ChatGPT for some help

- Here we build a prompt with which we'll query ChatGPT.  Note the "roles" in this little conversation give the LLM more context about who that part of the conversation is coming from.
- This may take 10-20 seconds to return, so be patient.

In [12]:

message_objects = []

# With the role as 'system',  we tell the model how we want it to behave and tell it how its personality and type of response should be.
message_objects.append({"role":"system",
                        "content":"You're a chatbot helping customers with questions and helping them with product recommendations"})


# With the role as 'user',  pass the question from user.
message_objects.append({"role":"user",
                        "content": customer_input})

message_objects.append({"role":"user",
                        "content": "Please give me a detailed explanation of your recommendations"})

message_objects.append({"role":"user",
                        "content": "Please be friendly and talk to me like a person, don't just give me a list of recommendations"})

message_objects.append({"role":"user",
                        "content":"The computer component itself should be one from the recommended products I will provide"})


# With the role as 'assistant',  load the results from Astra with Vector Search.  That helps the model to provide answer to the question asked by user.
message_objects.append({"role": "assistant",
                        "content": "I found these 5 products I would recommend"})

products_list = []

for row in top_5_products:
    brand_dict = {'role': "assistant", "content": f"{row.description}"}
    products_list.append(brand_dict)

message_objects.extend(products_list)
message_objects.append({"role": "assistant", "content":"Here's my summarized recommendation of products, and why it would suit you:"})

completion = openai.ChatCompletion.create(
  model="gpt-3.5-turbo",
  messages=message_objects
)
print(completion.choices[0].message['content'])

Based on your needs as a beginner photographer, I would recommend the Canon EOS Rebel XS Starter Kit. This kit includes a digital gadget bag to safely carry your camera and accessories, a battery pack for extended shooting sessions, and a 58mm UV haze filter to protect your lens. The Canon Rebel XS is a great entry-level camera with its 10.1-megapixel sensor, 7-point autofocus system, and easy-to-use interface. It's perfect for learning the basics of photography and capturing high-quality images. With this kit, you'll have everything you need to start your photography journey.
