# Vector Search Descriptions

This notebook extends Vector_Search_Basics by:

1. Generating consumer focused descriptions for each product using an LLM.
2. Create embeddings from these consumer descriptions.
3. Store new embeddings.

This notebook uses a table called `products_table_ext`.

## Imports

In [2]:
from cassandra.cluster import Cluster
from cassandra.auth import PlainTextAuthProvider
from cassandra.query import dict_factory
from cassandra.query import SimpleStatement
import openai
import pandas as pd

## Keys & Environment Variables

In [3]:
import os
from dotenv import load_dotenv

load_dotenv()

# Astra DB
ASTRA_DB_KEYSPACE = os.environ['ASTRA_DB_KEYSPACE']
ASTRA_DB_SECURE_BUNDLE_PATH = os.environ['ASTRA_DB_SECURE_BUNDLE_PATH']
ASTRA_DB_APPLICATION_TOKEN = os.environ['ASTRA_DB_APPLICATION_TOKEN']

# OpenAI Token
openai_api_key = os.environ['OPENAI_API_KEY']
openai.api_key = openai_api_key

## Select a model to compute embeddings

Embeddings are numerical representations of concepts converted to number sequences, which make it easy for computers to understand the relationships between those concepts.

This new embedding model from openAI - `text-embedding-ada-002` - replaces five separate models for text search, text similarity, and code search, and outperforms our previous most capable model, Davinci, at most tasks, while being priced 99.8% lower.

In [4]:
model_id = "text-embedding-ada-002"

## Connect to Astra DB

In [5]:
cloud_config= {
  'secure_connect_bundle': ASTRA_DB_SECURE_BUNDLE_PATH
}
auth_provider = PlainTextAuthProvider('token', ASTRA_DB_APPLICATION_TOKEN)
cluster = Cluster(cloud=cloud_config, auth_provider=auth_provider)
session = cluster.connect()
session.set_keyspace(ASTRA_DB_KEYSPACE)
session

<cassandra.cluster.Session at 0x107a302b0>

## Database Schema

> **Note:** The following blocks only need be run when you create the schema. Otherwise use them at your dicretion.

Note the data type `vector` in the schema below.

### Drop Schema

> **Note:** Only run this block when you want to DROP the schema.

In [6]:
# only use this to DROP the schema
session.execute(f"""DROP INDEX IF EXISTS product_desc""")
session.execute(f"""DROP INDEX IF EXISTS consumer_desc""")
session.execute(f"""DROP INDEX IF EXISTS combined_desc""")

session.execute(f"""DROP TABLE IF EXISTS products_table_ext""")

<cassandra.cluster.ResultSet at 0x15b5fe8e0>

### Create Schema

> **Note:** Only run this block when you want to CREATE the schema.

In [7]:
# CREATE the schema

session.execute(f"""CREATE TABLE IF NOT EXISTS products_table_ext
(product_id int,
 chunk_id int,

 product_name text,
 description text,
 consumer_description text,
 price text,
 
 product_description_embedding vector<float, 1536>,
 consumer_description_embedding vector<float, 1536>,
 combined_description_embedding vector<float, 1536>,

 PRIMARY KEY (product_id, chunk_id))""")

# Create Index
session.execute("CREATE CUSTOM INDEX IF NOT EXISTS product_desc ON products_table_ext (product_description_embedding) USING 'org.apache.cassandra.index.sai.StorageAttachedIndex'")
session.execute("CREATE CUSTOM INDEX IF NOT EXISTS consumer_desc ON products_table_ext (consumer_description_embedding) USING 'org.apache.cassandra.index.sai.StorageAttachedIndex'")
session.execute("CREATE CUSTOM INDEX IF NOT EXISTS combined_desc ON products_table_ext (combined_description_embedding) USING 'org.apache.cassandra.index.sai.StorageAttachedIndex'")


<cassandra.cluster.ResultSet at 0x15b711670>

## Create embeddings and Store in DB 

### Read CSV file

In [8]:
products_list = pd.read_csv('ProductDataset.csv')
products_list

Unnamed: 0,product_id,product_name,description,price
0,37162,Canon PIXMA Photo All-In-One Printer - MP620,Canon PIXMA Photo All-In-One Printer - MP620/ ...,$149.00
1,37174,TiVo HD XL Black Digital Video Recorder - TCD6...,TiVo HD XL Black Digital Video Recorder - TCD6...,$599.00
2,37181,Apple 8GB Black 2nd Generation iPod Touch - MB...,Apple 8GB Black 2nd Generation iPod Touch - MB...,$229.00
3,37182,Apple 16GB Black 2nd Generation iPod Touch - M...,Apple 16GB Black 2nd Generation iPod Touch - M...,$299.00
4,37183,Apple 32GB Black 2nd Generation iPod Touch - M...,Apple 32GB Black 2nd Generation iPod Touch - M...,$399.00
...,...,...,...,...
167,39088,Logitech Cordless Desktop Wave Keyboard And Mo...,Logitech Cordless Desktop Wave Keyboard And Mo...,$79.00
168,39090,Mitsubishi DLP Black TV Stand - MBS73V,Mitsubishi DLP Black TV Stand - MBS73V/ Matchi...,$549.00
169,39175,Logitech Digital Precision PC Gaming Headset -...,Logitech Digital Precision PC Gaming Headset -...,$49.00
170,39176,Logitech 2.1 Multimedia Silver Speaker System ...,Logitech 2.1 Multimedia Silver Speaker System ...,


### Generate consumer description based on product_description from openAI

In [9]:
products_list['consumer_description'] = ""

# Iterate over products
for id, row in products_list.iterrows():

    print (row.product_name)

    ### GENERATE CONSUMER DESCRIPTION ###
    print ("    - generating consumer description")

    # Create Prompt
    message_objects = []
    message_objects.append({"role":"user",
     "content": f"Provide a single paragraph consumer level description of the product: {row.product_name}"})

    # Generate consumer description
    completion = openai.ChatCompletion.create(model="gpt-3.5-turbo",messages=message_objects)
    consumer_description = completion.choices[0].message['content']
    
    # Update DataFrame with completion
    products_list.at[id,'consumer_description'] = consumer_description


    ### GENERATE EMBEDDINGS ###
    print ("    - generating embeddings")
    
    # Get price
    pricevalue = row.price if isinstance(row.price, str) else ""

    # append price to description
    original = f"{row.description} price: {pricevalue}"
    # append price to consumer description
    consumer = f"{consumer_description} price: {pricevalue}"
    # append price to combined description
    combined = f"{consumer_description} {row.description} price: {pricevalue}"
    
    # Create  embedding
    embedding = openai.Embedding.create(input=original, model=model_id)['data'][0]['embedding']
    # Create consumer embedding
    embedding_consumer = openai.Embedding.create(input=consumer, model=model_id)['data'][0]['embedding']
    # Create combined embedding
    embedding_combined = openai.Embedding.create(input=combined, model=model_id)['data'][0]['embedding']


    ### WRITE TO DATABASE ###
    print ("    - writing to database")
    
    # Insert Data and Embedding into database
    query = SimpleStatement(
                f"""
                INSERT INTO products_table_ext
                (product_id, chunk_id, product_name, description, consumer_description, price, product_description_embedding, consumer_description_embedding, combined_description_embedding)
                VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s)
                """
            )
    session.execute(query, (row.product_id, 0, row.product_name, row.description, consumer_description, pricevalue, embedding, embedding_consumer, embedding_combined))


Canon PIXMA Photo All-In-One Printer - MP620
    - generating consumer description
    - generating embeddings
    - writing to database
TiVo HD XL Black Digital Video Recorder - TCD658000
    - generating consumer description
    - generating embeddings
    - writing to database
Apple 8GB Black 2nd Generation iPod Touch - MB528LLA
    - generating consumer description
    - generating embeddings
    - writing to database
Apple 16GB Black 2nd Generation iPod Touch - MB531LLA
    - generating consumer description
    - generating embeddings
    - writing to database
Apple 32GB Black 2nd Generation iPod Touch - MB533LLA
    - generating consumer description
    - generating embeddings
    - writing to database
Apple 8GB Silver 4th Generation iPod Nano - MB598LLA
    - generating consumer description
    - generating embeddings
    - writing to database
Apple 8GB Blue 4th Generation iPod Nano - MB732LLA
    - generating consumer description
    - generating embeddings
    - writing to dat

In [None]:
# Write new product file
products_list.to_csv('ProductDatasetCombined.csv')

## Convert a query string into a text embedding to use as part of the query

In [10]:
customer_input = "recommend a camera for novice photographer"
embedding = openai.Embedding.create(input=customer_input, model=model_id)['data'][0]['embedding']
display(embedding)

[0.0014159156708046794,
 0.006145323161035776,
 -0.02011679857969284,
 -0.004408023785799742,
 0.003301865654066205,
 0.013479849323630333,
 -0.008550303988158703,
 -0.0399678535759449,
 -0.0153201250359416,
 -0.02216302417218685,
 0.007367744110524654,
 -0.010344074107706547,
 -0.0032005507964640856,
 -0.012164418585598469,
 -0.01793769933283329,
 -0.0072016543708741665,
 0.017725104466080666,
 0.0014391682343557477,
 0.003193907206878066,
 -0.022136449813842773,
 -0.023425307124853134,
 0.02103361487388611,
 0.003346709767356515,
 -0.018057284876704216,
 -0.02005036175251007,
 -0.0021292713936418295,
 0.014230575412511826,
 -0.00876954197883606,
 -0.020090224221348763,
 -0.015067667700350285,
 0.03281934931874275,
 -0.0023302400950342417,
 -0.016967736184597015,
 -0.04613310843706131,
 -0.00969300139695406,
 0.011566494591534138,
 0.01152663305401802,
 0.004743525292724371,
 0.009885665960609913,
 -0.001876814872957766,
 0.027291879057884216,
 0.005564008839428425,
 -0.00594933703541

## Find the top 5 results using ANN Similarity

Let's take a look at what a query against a vector index could look like.  The query vector has the same dimensions (number of entries in the list) as the embeddings we generated a few steps ago for each row in the database.

In [11]:
query = SimpleStatement(
    f"""
    SELECT product_id, product_name, description, consumer_description, price, similarity_dot_product(consumer_description_embedding, {embedding}) as sim
    FROM products_table_ext
    ORDER BY consumer_description_embedding ANN OF {embedding} LIMIT 5;
    """
    )
#display(query)

results = session.execute(query)
top_5_products = results._current_rows

for row in top_5_products:
  #print(row)
  print(f"""{row.sim}: {row.product_name}\n{row.consumer_description}\n\n""")

0.9217379093170166: Canon Digital EOS Rebel XS Starter Kit - 9320A010
The Canon Digital EOS Rebel XS Starter Kit is a fantastic entry-level camera package designed for beginner photographers. This kit includes the Canon EOS Rebel XS camera body, which boasts an impressive 10.1-megapixel resolution that captures stunning, high-quality photos. It comes with an 18-55mm lens that offers versatility for capturing a wide range of subjects, from breathtaking landscapes to detailed portraits. Additionally, this kit includes a memory card, making it easy to store and transfer your images, and a camera bag to protect your gear while on the go. With user-friendly features and excellent image quality, the Canon Digital EOS Rebel XS Starter Kit is perfect for those who want to take their photography skills to the next level.


0.9207944869995117: Canon PowerShot Silver 14.7 Megapixel Digital Camera - SD990IS
The Canon PowerShot Silver 14.7 Megapixel Digital Camera - SD990IS is a high-quality camera