<header style="padding:1px;background:#f9f9f9;border-top:3px solid #00b2b1"><img id="Teradata-logo" src="https://www.teradata.com/Teradata/Images/Rebrand/Teradata_logo-two_color.png" alt="Teradata" width="220" align="right" />

<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>Recommendations during a product search using Generative AI with Vantage</b>
</header>

<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>Introduction:</b></p>
<p style = 'font-size:16px;font-family:Arial'>The combination of OpenAIEmbeddings and Vantage in the db_function assists consumers in receiving product recommendations while looking for items on the website in the recommendations system using generative AI demo.</p>

<p style = 'font-size:16px;font-family:Arial'>In this demo, we will build a product recommendation system using OpenAI embeddings and Vantage in the db_function VectorDistance. Recommendation systems are a type of information filtering system that seeks to predict the rating or preference that a user would give to an item. They are often used on e-commerce websites to recommend products to users based on their past purchase history, browsing behaviour, and other factors. In this demo, we use product-to-product recommendations based on embedding distances. The VectorDistance function will return the closest products from the databases.</p>

<p style = 'font-size:16px;font-family:Arial'>The following diagram illustrates the architecture.</p>

<center><img src="images/header1.png" alt="Product_search_architecture"  width=600 height=600/></center>

<br>
<p style = 'font-size:16px;font-family:Arial'>Before going any farther, let's get a better understanding of Embeddings.</p>

<ul style = 'font-size:16px;font-family:Arial'><li> <b>Embeddings:</b></li></ul>
<p style = 'font-size:16px;font-family:Arial'> &emsp;  &emsp; Embeddings are the A.I-native way to represent any kind of data, making them the perfect fit for working with all kinds of A.I-powered tools and algorithms. They can represent text, images, and soon audio and video. There are many options for creating embeddings, whether locally using an installed library, or by calling an API.</p>

<p style = 'font-size:16px;font-family:Arial;color:#E37C4D'><b>Steps in the analysis:</b></p>
<ol style = 'font-size:16px;font-family:Arial'>
    <li>Configuring the environment</li>
    <li>Connect to Vantage</li>
    <li>Data Exploration</li>
    <li>Generate the OpenAI embeddings</li>
    <li>Calculate the VectorDistance using Teradata Vantage in-DB function</li>
    <li>Display the recommended products for the users</li>
    <li>Cleanup</li>
</ol>

<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>1. Configuring the environment</b>

In [None]:
%%capture
# '%%capture' suppresses the display of installation steps of the following packages

!pip install -r requirements.txt --quiet

<p style = 'font-size:16px;font-family:Arial'>
    <i>The above statements will install the required libraries to run this demo. To gain access to installed libraries after running this, restart the kernel.</i></p>

<p style = 'font-size:16px;font-family:Arial'><b>To restart the kernel, press the escape key first, then type 0 0.</b></p>

<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>1.1 Import the required libraries</b></p>

<p style = 'font-size:16px;font-family:Arial'>Here, we import the required libraries, set environment variables and environment paths (if required).</p>

In [None]:
import io
import os
import numpy as np
import pandas as pd
import tiktoken

# teradata lib
from teradataml import *
from teradataml import VectorDistance

# open AI
import openai
from openai.embeddings_utils import get_embedding

# Suppress warnings
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 
display.max_rows = 10

display.print_sqlmr_query=False
display.suppress_vantage_runtime_warnings=True

<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>2. Connection to Vantage and OpenAI</b>

<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>2.1 Get the OpenAI API key</b></p>

<p style = 'font-size:16px;font-family:Arial'>In order to utilize this demo, you will need an OpenAI API key. If you do not have one, please refer to the instructions provided in this guide to obtain your OpenAI API key: </p>

[Openai_setup_api_key_guide](..//Openai_setup_api_key/Openai_setup_api_key.md)

In [None]:
# enter your openai api key
api_key = input(prompt = '\n Please Enter Openai api key: ')

<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>2.2 Connect to Vantage</b></p>
<p style = 'font-size:16px;font-family:Arial'>You will be prompted to provide the password. Enter your password, press the Enter key, and then use the down arrow to go to the next cell.</p>

In [None]:
%run -i ../startup.ipynb
eng = create_context(host = 'host.docker.internal', username='demo_user', password = password)
print(eng)
eng.execute('''SET query_band='DEMO= Recommendations_product_search_OpenAI_Python.ipynb;' UPDATE FOR SESSION;''')

<p style = 'font-size:16px;font-family:Arial'>Begin running steps with Shift + Enter keys. </p>

<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>2.3 Getting Data for This Demo</b></p>
<p style = 'font-size:16px;font-family:Arial'>We have provided data for this demo on cloud storage. You can either run the demo using foreign tables to access the data without any storage on your environment or download the data to local storage, which may yield faster execution. Still, there could be considerations of available storage. Two statements are in the following cell, and one is commented out. You may switch which mode you choose by changing the comment string.</p>

In [None]:
%run -i ../run_procedure.py "call get_data('DEMO_Grocery_Data_cloud');"        # Takes 1 minute
# %run -i ../run_procedure.py "call get_data('DEMO_Grocery_Data_local');"        # Takes 2 minutes

<p style = 'font-size:16px;font-family:Arial'>Next is an optional step – if you want to see the status of databases/tables created and space used.</p>

In [None]:
%run -i ../run_procedure.py "call space_report();"        # Takes 10 seconds

<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>3. Data Exploration</b>

<p style = 'font-size:16px;font-family:Arial'>Product recommendation systems are a type of recommender system that suggests products to users based on what they are searching for in the search box. To recommend products to users, we will use OpenAI embeddings and Vantage in db_function.</p>

<p style = 'font-size:16px;font-family:Arial'>The data for this demo comes from the products table of Instacart. There are also a few other tables, such as orders, aisles, departments, and order_products_prior. However, for this demo, we will only use the products table.</p>

<p style = 'font-size:16px;font-family:Arial'>The products table contains information about all of the products that are available on Instacart. This includes the product id, product name, etc. The table also includes the product's department and aisle, which can be used to group products together.</p>

<p style = 'font-size:16px;font-family:Arial'>The other tables in the Instacart dataset contain additional information about orders, aisles, departments, and product purchases. However, for this demo, we will only focus on the products table.<p/>

<p style = 'font-size:16px;font-family:Arial'>Each row is a snapshot of data taken from the products table, Below are the list of columns in the product table:</p>
<p style = 'font-size:16px;font-family:Arial'> 
<ol style = 'font-size:16px;font-family:Arial'>
    <li>product_id</li>
    <li>product_name</li>
    <li>aisle_id</li>
   <li>department_id</li>

</ol>
</p>

<p style = 'font-size:16px;font-family:Arial'>The source data from <a href="https://www.kaggle.com/competitions/instacart-market-basket-analysis/data">kaggle</a> is loaded in Vantage with table named <i>Products</i>.</p>

<p style = 'font-size:16px;font-family:Arial'><b><i>*Please scroll down to the end of the notebook for detailed column descriptions of the dataset.</i></b></p>

<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>3.1 Examine the Products table</b></p>    
<p style = 'font-size:16px;font-family:Arial'>Let's look at the sample data in the Products table.</p>

In [None]:
tdf = DataFrame(in_schema('DEMO_Grocery_Data', 'products'))
print("Data information: \n",tdf.shape)
tdf.sort('product_id')

<p style = 'font-size:16px;font-family:Arial'>There are approx 50K records in all, and there are 4 variables. Products are listed from different departments. We shall recommend the products to the user when user is searching for some items from the page.</p>

<p style = 'font-size:16px;font-family:Arial'>To save the cost of generating embeddings from OpenAI, we will use the <b>first 50 products</b> in this demo. This will allow us to test the system without incurring too much cost. Once we have validated the system, we can then consider expanding it to include more products.</p>

In [None]:
tdf_sample = tdf.loc[tdf.product_id <= 2]

print(tdf_sample.shape)
tdf_sample.sort('product_id')

<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>4. Generate the embeddings </b>
<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>4.1 Generate the embeddings for product table</b></p>    

<p style = 'font-size:16px;font-family:Arial'>Under the hood, we will use the OpenAI embeddings method to generate the embeddings. OpenAI embeddings are a type of word embedding that can be used to represent products in a way that captures their semantic meaning. To generate embeddings for a product table, we will use the product name field. We will use the OpenAI Embeddings API to generate embeddings for each product. Please refer to the <a href="https://platform.openai.com/docs/guides/embeddings"> Embeddings documentation</a> for more information about embeddings and types of models available.</p>

<p style = 'font-size:16px;font-family:Arial'>The OpenAI Embeddings API takes a text string as input and returns a vector of numbers that represent the embedding. The length of the vector depends on the model that you are using. For example, the text-embedding-ada-002 model returns a vector of 1536 numbers.</p>

<p style = 'font-size:16px;font-family:Arial'>In this demo, we will use text-embedding-ada-002 as the model and cl100k_base as the encoding technique.</p>

In [None]:
# embedding model parameters
embedding_model = "text-embedding-ada-002"
# set api key
openai.api_key = api_key

def get_embeddings(tdf):
    # convert to pandas df
    result_df = tdf.to_pandas()

    # This may take a few minutes
    result_df["embedding"] = result_df.product_name.apply(lambda x: get_embedding(x, engine=embedding_model))
    
    # Generate all the embeddings columns from the "embeddings" column.
    for i in range(result_df.iloc[:1, 4].apply(len).values[0]):
        result_df["embeddings_{}".format(i)] = result_df["embedding"].apply(lambda x: x[i])
    
    # drop embedding 
    result_df.drop("embedding", axis=1, inplace=True)
    return result_df

<p style = 'font-size:16px;font-family:Arial'>To generate the embeddings, we will call the <b>get_embeddings()</b> function. This function will convert the Teradata DataFrame to a Pandas DataFrame and generate the embeddings. Once the embeddings are generated, we will store them in separate columns so that we can pass them to the <b>VectorDistance()</b> function later on.</p>

In [None]:
df_sample = get_embeddings(tdf_sample)

# Print the DataFrame.
df_sample.head()

<p style = 'font-size:16px;font-family:Arial'>We can see that generated embeddings for all of the products are in vector of 1536 columns. </p>

<p style = 'font-size:16px;font-family:Arial'>For example: The generated embeddings for product name: <b>All-Seasons Salt</b> is looks like vector of 1536 numbers. Generated vector fot this product name is looks like: -0.010695	0.007814	0.011698	-0.009455	-0.023762	0.013521	...	 </p>

<p style = 'font-size:16px;font-family:Arial'>Now, we have generated the embeddings from the product names.so to use it further first we have to save the product embeddings dataframe into a vantage table named <b>product_embeddings</b>.</p>

In [None]:
copy_to_sql(df_sample,
            table_name='product_embeddings',
            if_exists = 'replace')

<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b> 4.2 Get the embedding for few product search terms</b></p>

<p style = 'font-size:16px;font-family:Arial'>Let's take <b>5 random products</b> to check their recommended products from our database. To do this, we need to follow the same process as before: generate the embeddings for the products and store them back to the Vantage table.</p>

In [None]:
tdf_search_products = tdf.loc[(tdf.product_id > 55) & (tdf.product_id <= 57)]

print(tdf_search_products.shape)
tdf_search_products.sort('product_id').head(10)

<p style = 'font-size:16px;font-family:Arial'>The get_embeddings() function uses the OpenAI Embeddings API to generate the embeddings.</p>

In [None]:
df_search_products = get_embeddings(tdf_search_products)

# Print the DataFrame.
df_search_products.head()

<p style = 'font-size:16px;font-family:Arial'>Since the product names were searched, we have now generated the embeddings. The product embeddings dataframe must therefore be saved into a new table called <b>search_product_embeddings</b> before we can utilise it further..</p>

In [None]:
copy_to_sql(df_search_products,
            table_name='search_product_embeddings',
            if_exists = 'replace',
            primary_index = 'product_id')

<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>5. Calculate the VectorDistance using Teradata Vantage in-DB function</b>

<p style = 'font-size:16px;font-family:Arial'>The TD_VectorDistance function accepts a table of target vectors and a table of reference vectors and returns a table that contains the distance between target-reference pairs.</p>

<p style = 'font-size:16px;font-family:Arial'>The function computes the distance between the target pair and the reference pair from the same table if you provide only one table as the input.</p>

In [None]:
product_embeddings_df = DataFrame(in_schema('demo_user', 'product_embeddings'))
search_product_embeddings_df = DataFrame(in_schema('demo_user', 'search_product_embeddings'))

# print("product_embeddings_df: ",product_embeddings_df.shape)
# print("search_product_embeddings_df: ",search_product_embeddings_df.shape)

# list out the embedding column names
emb_column_names = search_product_embeddings_df.columns[4:]
search_product_embeddings_df = search_product_embeddings_df.set_index(keys='product_id')

<p style = 'font-size:16px;font-family:Arial'>The VectorDistance function calculates the distance between a target vector and a reference vector. We use the cosine distance metric, which measures the similarity between two vectors. The function can return the maximum of 1 to 100 closest reference vectors to include in the output table for each target vector. In this demo, we want the top 5 closest reference vectors to the target vector.</p>

In [None]:
display.print_sqlmr_query=False

# from teradataml import VectorDistance
vector_result = VectorDistance(
                    target_id_column='product_id',
                    target_feature_columns=emb_column_names,
                    ref_id_column='product_id',
                    ref_feature_columns=emb_column_names,
                    distance_measure=['Cosine'],
                    topk=5,
                    target_data=product_embeddings_df,
                    reference_data=search_product_embeddings_df)

<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>6. Display the recommended products for the users.</b>

<p style = 'font-size:16px;font-family:Arial'>To view the recommendations, we need to join two tables together. First, we will join the vector distance result table with the product embeddings table. This will give us a table that contains the vector distance scores for each product, as well as the product embeddings. Then, we will join this table with the search products table. This will give us a final table that contains the recommendations for the search products.</p>

In [None]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 
display.suppress_vantage_runtime_warnings=True

product_embeddings_df_selected_columns = product_embeddings_df.select(["product_id", "product_name"])
join_result = product_embeddings_df_selected_columns.join(vector_result.result, on = ["product_id=target_id"], how = "inner", lprefix = "pro", rprefix = "vec")

# sort by distance
join_result_sorted = join_result.sort(["distance"], ascending=False)

# join the above joined table with search products
join_result_sorted_selected = join_result_sorted.select(["product_name", "reference_id","distancetype","distance"])
tdf_search_products = search_product_embeddings_df.select(["product_id", "product_name"])

# recommandation results
final_results = join_result_sorted_selected.join(search_product_embeddings_df, on = ["reference_id=product_id"], how = "inner", lprefix = "t1", rprefix = "t2")
final_results.head()

<p style = 'font-size:16px;font-family:Arial'>In the above table we can see the recommendations for first product.</p>

<hr>
<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>5. Cleanup</b>
<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'><b>Work Tables</b></p>
<p style = 'font-size:16px;font-family:Arial'>Cleanup work tables to prevent errors next time.</p>

In [None]:
tables = ["product_embeddings", "search_product_embeddings"]

for t in tables:
        try:
            db_drop_table(table_name=t)
        except:
            pass

<p style = 'font-size:18px;font-family:Arial;color:#E37C4D'> <b>Databases and Tables </b></p>
<p style = 'font-size:16px;font-family:Arial'>The following code will clean up tables and databases created above.</p>

In [None]:
%run -i ../run_procedure.py "call remove_data('DEMO_Grocery_Data');"        # Takes 5 seconds

In [None]:
remove_context()

<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>Dataset:</b>


- `product_id`: Unique row customer id
- `product_name`: customer age (numeric)
- `aisle_id` : Aisle id (numeric)
- `department_id` : Depatment id (numeric)

<p style = 'font-size:16px;font-family:Arial;color:#E37C4D'><b>Links:</b></p>
<ul style = 'font-size:16px;font-family:Arial'>
    <li>Teradataml Python reference: <a href = 'https://docs.teradata.com/search/all?query=Python+Package+User+Guide&content-lang=en-US'>here</a></li>
    <li>OpenAI embeddings reference: <a href='https://platform.openai.com/docs/guides/embeddings'>here</a></li>
</ul>

<footer style="padding:10px;background:#f9f9f9;border-bottom:3px solid #394851">Copyright © Teradata Corporation - 2023. All Rights Reserved.</footer>