<header>
   <p  style='font-size:36px;font-family:Arial; color:#F0F0F0; background-color: #00233c; padding-left: 20pt; padding-top: 20pt;padding-bottom: 10pt; padding-right: 20pt;'>
       Customer Reviews Analysis using Generative AI with Vantage
  <br>
       <img id="teradata-logo" src="https://storage.googleapis.com/clearscape_analytics_demo_data/DEMO_Logo/teradata.svg" alt="Teradata" style="width: 125px; height: auto; margin-top: 20pt;">
    </p>
</header>

<p style = 'font-size:18px;font-family:Arial'><b>Introduction:</b></p>

<p style = 'font-size:16px;font-family:Arial'>Customer reviews analysis is a crucial aspect of understanding customer sentiment and preferences. By leveraging the power of <b>OpenAIEmbeddings</b> and <b>Vantage InDB Analytic Function</b>, we can gain valuable insights from customer reviews.</p>

<p style = 'font-size:18px;font-family:Arial'><b>Why Vantage?</b></p>
<p style = 'font-size:16px;font-family:Arial'>In our demo,  we use the <code>TDApiClient</code> Vantage function to generate reviews embeddings in parallel. This approach can significantly enhance the performance of generating the embeddings at scale.</p>

<p style = 'font-size:16px;font-family:Arial'>In this demo, we will build a customer review analysis system using <b>TDApiClient InDB Analytic Function</b>. Customer reviews analysis involves processing and analyzing customer feedback to identify patterns, sentiment, and trends. This analysis can help businesses improve their products, services, and overall customer experience. By integrating <b>OpenAIEmbeddings</b> and <b>Vantage InDB Analytic Function</b>, we can perform advanced natural language processing (NLP) and machine learning techniques to extract meaningful insights from customer reviews. </p>

<p style = 'font-size:16px;font-family:Arial'>The following diagram illustrates the architecture.</p>

<center><img src="images/review_analysis_main_title.svg" alt="review_analysis1" width=1400 height=1200/></center>

<br>
<p style = 'font-size:16px;font-family:Arial'>Before going any farther, let's get a better understanding of Embeddings</p>

<br>


<ul style = 'font-size:16px;font-family:Arial'><li> <b>Embeddings:</b></li></ul>

<p style = 'font-size:16px;font-family:Arial'> &emsp;  &emsp; We believe that embeddings are the AI-native way to represent any kind of data, making them the perfect fit for us when working with all kinds of AI-powered tools and algorithms. We can represent text, images, and soon audio and video. We have many options for creating embeddings, whether locally using an installed library or by calling an API.</p>


<p style = 'font-size:16px;font-family:Arial'> Embedding models, like Word2Vec or GloVe, learn vector representations for words based on co-occurrence statistics. For instance, in Word2Vec, a word's vector is optimized to predict surrounding words in a context window. Each word's vector captures semantic relationships, with similar words having closer vectors. In essence, the model learns to represent words in a multi-dimensional space where similar words are close together. For example, "king" and "queen" might have similar vectors due to their contextual similarity in many sentences.</p>


<p style = 'font-size:16px;font-family:Arial'>Imagine we have a bunch of words, and we want to find a way to represent them in a way that captures their meaning. One way we can do this is by creating a word embedding. A word embedding is a vector of numbers that represents the meaning of a word. We choose the numbers in the vector so that words that are similar in meaning have similar vectors.</p>

<p style = 'font-size:16px;font-family:Arial'>For example, we might have vectors for words like "cheese," "butter," "chocolate," and "sauce" that look like the following:</p>

<center><img src="images/word_embeddings.png" alt="word_embeddings"  width=1000 height=800/></center>

<br>
<p style = 'font-size:16px;font-family:Arial'>In our vector, the numbers don't have any special meaning by themselves. They just represent the way that the word "cheese" is related to other words in our vocabulary.</p>

<p style = 'font-size:16px;font-family:Arial'>We can use word embeddings to find the similarity between words. For example, we can calculate the cosine similarity between the vector for "cheese" and the vector for "butter". The cosine similarity is a measure of how similar two vectors are, and it ranges from 0 to 1. A cosine similarity of 1 means that the two vectors are perfectly aligned, and a cosine similarity of 0 means that the two vectors are completely unrelated.</p>

<p style = 'font-size:16px;font-family:Arial'>We can also use word embeddings to find related words. For example, we can find all of the words that are similar in meaning to "cheese". This would include words like "milk", "cream", "yogurt", and "feta".</p>

<p style = 'font-size:16px;font-family:Arial'>We find word embeddings to be a powerful tool for natural language processing. We can utilize them for a variety of tasks, such as sentiment analysis, machine translation, and question answering.</p>

<p style = 'font-size:16px;font-family:Arial'>Above is a visual representation of how word embeddings work.</p>

<p style = 'font-size:16px;font-family:Arial'>Imagine we have a bunch of points in a high-dimensional space. Each point represents a word, and our position in space represents the meaning of the word. Words that are similar in meaning will be close together in space, and words that are different in meaning will be far apart.</p>

<p style = 'font-size:16px;font-family:Arial'>Now, imagine that we take a slice through our high-dimensional space. This slice will be a two-dimensional space, and the points in our two-dimensional space will represent our word embeddings. The distance between two points in our two-dimensional space will be a measure of the similarity between the two words.</p>

<p style = 'font-size:16px;font-family:Arial'>In this way, we can use word embeddings to represent the meaning of words in a way that is both compact and informative.</p>

<p style = 'font-size:16px;font-family:Arial'><b>Steps in the analysis:</b></p>
<ol style = 'font-size:16px;font-family:Arial'>
    <li>Configuring the environment</li>
    <li>Connect to Vantage</li>
    <li>Data Exploration</li>
    <li>Generate the embeddings</li>
    <li>Load the existing embeddings to DB</li>
    <li>Clustering using Teradata Vantage in-DB function</li>
    <li>Visualization of Clusters with Customer review</li>
    <li>Sentiment Analysis</li>
    <li>Topic Modeling</li>
    <li>Cleanup</li>
</ol>

<hr style='height:2px;border:none;'>
<b style = 'font-size:20px;font-family:Arial'>1. Configuring the environment</b>

In [None]:
%%capture
# '%%capture' suppresses the display of installation steps of the following packages

!pip install --upgrade -r requirements.txt --quiet

<p style = 'font-size:16px;font-family:Arial'>
    <i>The above statements will install the required libraries for us to run this demo. To gain access to the installed libraries after running this, we should restart the kernel.</i></p>

<div class="alert alert-block alert-info">
    <p style='font-size:16px;font-family:Arial'><i><b>Note:</b> We want to bring to your attention that the above statements may need to be uncommented if you run the notebooks on a platform other than ClearScape Analytics Experience that does not have the libraries installed. If we uncomment those installs, we ensure to restart the kernel after executing those lines to bring the installed libraries into memory. The simplest way to restart the Kernel is by typing zero zero: <b>0 0</b></i></p>
</div>

<hr style="height:1px;border:none;">

<p style = 'font-size:18px;font-family:Arial'><b>1.1 Import the required libraries</b></p>

<p style = 'font-size:16px;font-family:Arial'>Here, we import the required libraries, set environment variables and environment paths (if required).</p>

In [None]:
import io
import os
import numpy as np
import pandas as pd

# vis
import plotly.express as px
import plotly.graph_objects as go

import timeit
import tqdm
from tqdm.notebook import *

tqdm_notebook.pandas()


# teradata lib
from teradataml import *
from utils.sql_helper_func import *

# Suppress warnings
import warnings

warnings.filterwarnings("ignore", category=DeprecationWarning)
display.max_rows = 5

display.print_sqlmr_query = False
display.suppress_vantage_runtime_warnings = True
pd.set_option("display.max_rows", 20)

<hr style='height:2px;border:none;'>
<b style = 'font-size:20px;font-family:Arial'>2. Connect to Vantage</b>

<hr style='height:1px;border:none;'>
<p style = 'font-size:18px;font-family:Arial'><b>2.1 Connect to Vantage</b></p>
<p style = 'font-size:16px;font-family:Arial'>We will be prompted to provide the password. We will enter the password, press the Enter key, and then use the down arrow to go to the next cell.</p>

In [None]:
%run -i ../startup.ipynb
eng = create_context(host = 'host.docker.internal', username='demo_user', password = password)
print(eng)
execute_sql('''SET query_band='DEMO=Customer_reviews_analysis_using_GenAI_Python.ipynb;' UPDATE FOR SESSION;''')

<p style = 'font-size:16px;font-family:Arial'>Begin running steps with Shift + Enter keys. </p>

<hr style='height:1px;border:none;'>

<p style = 'font-size:18px;font-family:Arial'><b>2.2 Getting Data for This Demo</b></p>
<p style = 'font-size:16px;font-family:Arial'>We have provided data for this demo on cloud storage. We can either run the demo using foreign tables to access the data without any storage on your environment or download the data to local storage, which may yield faster execution. Still, there could be considerations of available storage. Two statements are in the following cell, and one is commented out. We may switch which mode to choose by changing the comment string.</p>

In [None]:
# %run -i ../run_procedure.py "call get_data('DEMO_CustomerReviews_cloud');"        # Takes 30 seconds
%run -i ../run_procedure.py "call get_data('DEMO_CustomerReviews_local');"        # Takes 1 minute

<p style = 'font-size:16px;font-family:Arial'>Next is an optional step – if we want to see the status of databases/tables created and space used.</p>

In [None]:
%run -i ../run_procedure.py "call space_report();"        # Takes 10 seconds

<hr style='height:2px;border:none;'>
<b style = 'font-size:20px;font-family:Arial'>3. Data Exploration</b>

<p style = 'font-size:16px;font-family:Arial'>The data for this demo comes from the reviews table of TPCx-AI. There are also a few other tables, such as orders, line-items, order_returns, products, etc. However, for this demo, we will only use the reviews table.</p>

<p style = 'font-size:16px;font-family:Arial'>The reviews table contains information about all of the customer reviews that are available in TPCx-AI. This includes the customer id, review, and spam.</p>


<p style = 'font-size:16px;font-family:Arial'>Each row is a snapshot of data taken from the review table, below are the list of columns in the review table:</p>
<p style = 'font-size:16px;font-family:Arial'> 
<ol style = 'font-size:16px;font-family:Arial'>
    <li>id</li>
    <li>review</li>
    <li>spam</li>

</ol>
</p>

<p style = 'font-size:16px;font-family:Arial'>The source data from <a href="https://www.tpc.org/tpcx-ai/default5.asp">TPCx-AI</a> is loaded in Vantage with table named <i>marketplace</i>.</p>

<p style = 'font-size:16px;font-family:Arial'><b><i>*Please click <a href="#section102">here</a> scroll down to the end of the notebook for detailed column descriptions of the dataset.</i></b></p>

<hr style='height:1px;border:none;'>
<p style = 'font-size:18px;font-family:Arial'><b>3.1 Examine the Customer review table</b></p>    
<p style = 'font-size:16px;font-family:Arial'>Let's look at the sample data in the customer review table.</p>

In [None]:
# tdf = DataFrame("customer_reviews")
tdf = DataFrame(in_schema("DEMO_CustomerReviews", "customer_review"))
print("Data information: \n", tdf.shape)
tdf.sort("id")

<p style = 'font-size:16px;font-family:Arial'>There are approx 1k sample records in all, and there are 3 variables. Reviews are listed from different customers. We shall analyse reviews for sentiments and major topics.</p>

<hr style='height:1px;border:none;'>

<p style = 'font-size:18px;font-family:Arial'><b>3.2 Do we want to generate the embeddings?</b></p>    
<p style = 'font-size:16px;font-family:Arial'>We have already generated embeddings for the customer review and stored them in files.</p>

<center><img src="images/decision_emb_generation.svg" alt="embeddings_decision" /></center>

<div class="alert alert-block alert-info">
    <p style = 'font-size:16px;font-family:Arial'><i><b>Note: If we would like to skip the embedding generation step and move on to the next section, please click  <a href="#section501">here</a> to skip.</b></i></p>
</div>

<p style = 'font-size:16px;font-family:Arial'>To save time, we can move to the already generated embeddings section. However, if we would like to see how we generate the embeddings, or if we need to generate the embeddings for a different dataset, then continue to the following section.</p>

<hr style='height:2px;border:none;'>
<a id='section4'></a>

<b style = 'font-size:20px;font-family:Arial'>4. Generate the embeddings </b>


<div class="alert alert-block alert-warning">
    <p style = 'font-size:16px;font-family:Arial'><i><b>In this section, we are creating the OpenAI embeddings for 1000 customer reviews. It will cost us a few dollars on our OpenAI account.</b></i></p>
</div>

<hr style='height:1px;border:none;'>

<p style = 'font-size:18px;font-family:Arial'><b>4.1 Get the OpenAI API key</b></p>

<p style = 'font-size:16px;font-family:Arial'>In order to utilize this demo, you will need an OpenAI API key. If you do not have one, please refer to the instructions provided in this guide to obtain your OpenAI API key: </p>


<a href="..//Openai_setup_api_key/Openai_setup_api_key.md" style="text-decoration:none;" target="_blank"><button style="font-size:16px;font-family:Arial;color:#fff;background-color:#00233C;border:none;border-radius:5px;cursor:pointer;height:50px;line-height:50px;display:flex;align-items:center;">OpenAI API Key Guide <span style="margin-left:10px;">&#8658;</span></button>
</a>

In [None]:
import getpass

# enter your openai api key
api_key = getpass.getpass("\n Please Enter OpenAI API key: ")

<hr style='height:1px;border:none;'>
<p style = 'font-size:18px;font-family:Arial'><b>4.2 Generate the embeddings for customer review via API_Request In-database Function</b></p>    

<p style='font-size:16px;font-family:Arial'>OpenAI and Azure OpenAI, provide multiple APIs for our hosted models. We introduce integration with the embedding API, which can be used in various types of applications: Classification, Search, Recommendations, and Anomaly detection. For more information on our Teradata API Integration, click <a href='https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/Teradata-VantageTM-API-Integration-Guide-for-Cloud-Machine-Learning/Teradata-Partner-API/Welcome-to-Teradata-API-Integration'>here.</a></p>

<p style='font-size:16px;font-family:Arial'>Under the hood, we will utilize the OpenAI embeddings method to generate the embeddings. OpenAI embeddings are a type of word embedding that we can use to represent review in a way that captures their semantic meaning. To generate embeddings for a customer review table, we will use the review field. We will employ the OpenAI Embeddings API to generate embeddings for each customer review. Please refer to the <a href="https://platform.openai.com/docs/guides/embeddings"> Embeddings documentation</a> for more information about embeddings and types of models available.</p>

<p style='font-size:16px;font-family:Arial'>The OpenAI Embeddings API takes a text string as input and returns a vector of numbers that represent the embedding. The length of the vector depends on the model that we are using. For example, the <b>text-embedding-3-small</b> model returns a vector of 1536 numbers.</p>

<p style='font-size:16px;font-family:Arial'>In this demo, we will use <b>text-embedding-3-small</b> as the model and pass num_embeddings to <b>1536</b>.</p>

<p style = 'font-size:16px;font-family:Arial'>To generate the embeddings, we will call the <b>get_embeddings()</b> function. This function will convert the Teradata DataFrame to a Pandas DataFrame and generate the embeddings. Once the embeddings are generated, we will store them in separate columns so that we can pass them to the <b>K-Means</b> function later on.</p>

<div class="alert alert-block alert-info">
    <p style = 'font-size:16px;font-family:Arial'><i><b>Please be patient:</b> Generating embeddings for 1000 review may take from 60 to 100 seconds. It is depends on number of APMs in the database. Since the volume of data is large and the machine is small, going through the below code could take up to 2 minutes. If we prefer to skip this step and proceed to the next section, we can click  <a href="#section51">here</a> to skip. </i></p>
</div>

In [None]:
from tdapiclient import TDApiClient

start = timeit.default_timer()
print("embeddings generation started.. please wait")

# set embedding model from openai models
embedding_model = "text-embedding-3-small"

tdf_review_embeddings = TDApiClient.API_Request(
    dataframe=tdf,
    api_type="open-ai-embedding",
    model_name=embedding_model,
    authorization="""{{"Key":"{}"}}""".format(api_key),
    text_column="review",
)

end = timeit.default_timer()
load_time = end - start

print(
    f"generate the embeddings for {tdf_review_embeddings.shape[0]} text:\t",
    load_time,
)
print("----- complete -----")

<a id='section42'></a>
<hr style='height:1px;border:none;'>

<p style = 'font-size:18px;font-family:Arial'><b>4.3 Display the customer review embeddings</b></p>

In [None]:
print("Data information: \n", tdf_review_embeddings.shape)

display.suppress_vantage_runtime_warnings = True
tdf_review_embeddings

<p style = 'font-size:16px;font-family:Arial'>We can see that generated embeddings for all of the review are in vector of 1536 columns. </p>

<p style = 'font-size:16px;font-family:Arial'>For example: The generated embeddings for review name: <b>Nike SALE Nike UltraBoost These are women's men's	</b> consists of 1536 numbers and looks like:<br>
<code>-0.0197175870	0.005220166	-0.017851508	0.00217428	-0.00335274	-0.0050890</code></p>

<p style = 'font-size:16px;font-family:Arial'>Now, we have generated the embeddings from the review names and saved the review embeddings dataframe into a vantage table named <b>review_embeddings</b> to use it further.</p>

In [None]:
delete_and_copy_embeddings(
    table_name="review_embeddings", tdf=tdf_review_embeddings, eng=eng
)

<div class="alert alert-block alert-info" id="no-azure">
    <p style = 'font-size:16px;font-family:Arial'><i><b>Note</b>: If you're generating embeddings for a new document and plan to store it as a file, consider uncommenting the code below. Doing so will significantly speed up the process in future runs by skipping section 4 altogether.</i></p>
</div>

In [None]:
# store the embeddings if you're generating for new document for speed up in next run
# df = tdf_review_embeddings.to_pandas().reset_index()
# df.to_parquet('./embeddings/review_embeddings_openai_prq_1k.parquet.gzip',compression='gzip')

<a id='section501'></a>

<hr style='height:2px;border:none;'>
<b style = 'font-size:20px;font-family:Arial'>5. Load the existing embeddings to DB</b>

<hr style='height:1px;border:none;'>
<p style = 'font-size:18px;font-family:Arial'><b>5.1 Load the reviews embeddings</b></p>

<p style = 'font-size:16px;font-family:Arial'>In this demo, we will load existing embeddings from files to a database. This will allow us to perform further processing on the embeddings.</p>

<div class="alert alert-block alert-info" id="no-azure">
    <p style = 'font-size:16px;font-family:Arial'><i><b>Note</b>: If you have already executed the Generate the embeddings section, then below code will be skipped automatically.</i></p>
</div>

In [None]:
is_section4_executed = False
try:
    is_section4_executed = DataFrame.from_table("review_embeddings").size > 0
except:
    is_section4_executed = False

In [None]:
from IPython.display import display, Markdown

def get_section5_desc_start():
    return """<p style = 'font-size:16px;font-family:Arial'>The code above first reads the data from the files. The files contain information about the review embeddings. The code then loads the data into a permanent table in SQL. Once the data is loaded, we will use the Vantage InDB Analytic Function  <code>KMeans</code> to clusters the review embeddings. The data contains review embeddings, which are lists of numerical values, or vectors.</p>
    <p style = 'font-size:16px;font-family:Arial'>The embeddings file contains over 1000 records, each with 1536 numerical features. This means that the file is quite large and it may take some time to load it into SQL.</p>
    <div class="alert alert-block alert-info" id="no-azure">
    <p style = 'font-size:16px;font-family:Arial'><i><b>Note</b>: Please be patient. The code above is loading data from files and copying it to SQL. This process may take 30-50 seconds.</i></p>
    </div>"""


def get_section5_desc_end():
    return """<a id='section52'></a><hr style='height:1px;border:none;'><p style = 'font-size:18px;font-family:Arial'><b>5.2 Display the review embeddings</b></p>
    <p style = 'font-size:16px;font-family:Arial'>To give you a better idea of what the embeddings look like, here are the first five rows of the review embeddings:</p>"""


def get_section5_desc_sample():
    return """<p style = 'font-size:16px;font-family:Arial'>We can see that generated embeddings for all of the review are in vector of 1536 columns. </p>
    <p style = 'font-size:16px;font-family:Arial'>For example: The generated embeddings for review name: <b>Nike running shoes for man</b> consists of 1536 numbers and looks like:<br>
    <code>-0.038744	-0.016937	-0.017475	0.003624	0.00744	-0.00275	0.02374</code></p>"""

def load_the_emb():
    is_section5_executed = False
    if not is_section4_executed:

        is_section5_executed = True

        start = timeit.default_timer()
        display(Markdown(get_section5_desc_start()))

        # load review_embeddings to sql
        text_embeddings_os_prq = pd.read_parquet(
            "./embeddings/review_embeddings_openai_prq_1k.parquet.gzip"
        )


        print("embeddings shape", text_embeddings_os_prq.shape)
        delete_and_copy_embeddings(
            table_name="review_embeddings", tdf=text_embeddings_os_prq, eng=eng
        )

        end = timeit.default_timer()
        load_time = end - start
        print(f"embeddings load time:\t", load_time)


        display(Markdown(get_section5_desc_end()))
        review_embeddings_os = DataFrame(in_schema("demo_user", "review_embeddings"))
        return review_embeddings_os, is_section5_executed
    else:
        # print("Section 4: Generate the embeddings is already executed!")
        display(
            Markdown(
                """<br><div class="alert alert-block alert-success">
        <p style = 'font-size:16px;font-family:Arial'><i>Section 4: Generate the embeddings is already executed! So, skipping the execution of above code.</i></p></div>"""
            )
        )


        return None, is_section5_executed


sample_embeddings, flag = load_the_emb()
sample_embeddings.sort("id") if sample_embeddings is not None else None

<p style = 'font-size:16px;font-family:Arial'>The code below will not run if Section 5 has already been skipped.</p>

In [None]:
display(Markdown(get_section5_desc_sample())) if flag else None

<hr style='height:2px;border:none;'>
<b style = 'font-size:20px;font-family:Arial'>6. Calculate the K-Means Clusters using Teradata Vantage in-DB function</b>

<p style = 'font-size:16px;font-family:Arial'>The k-means algorithm groups a set of observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or cluster centroid). This algorithm minimizes the objective function, that is, the total Euclidean distance of all data points from the center of the cluster as follows:</p>
<ul style = 'font-size:16px;font-family:Arial'>
<li>Specify or randomly select k initial cluster centroids.</li>
<li>Assign each data point to the cluster that has the closest centroid.</li>
<li>Recalculate the positions of the k centroids.</li>
<li>Repeat steps 2 and 3 until the centroids no longer move.</li>
    </ul>

<a id='section41'></a>
<hr style='height:1px;border:none;'>
<p style = 'font-size:18px;font-family:Arial'><b>6.1 Filter columns from the embeddings</b></p>
<p style = 'font-size:16px;font-family:Arial'>In the steps above we took a sample of the dataset, The sample consisted of 1000 sample reviews from e-commerce customers. we need to find clusters in these reviews. In order to find K-Means clusters we just need the embeddings information so we discard remaining columns from the dataframe</p>

In [None]:
review_embeddings = sample_embeddings if flag else tdf_review_embeddings
embedding_column_list = review_embeddings.drop(columns=["id", "review", "spam"])
embedding_column_names = review_embeddings.columns[3:]

<p style = 'font-size:16px;font-family:Arial'>The next question we face is <b>How many clusters?</b>. To find this number we use a technique called <b>Elbow method</b>. The Elbow Method is a technique used in data science to help determine the optimal number of clusters in a dataset. In the code snippet below we try cluster values ranging from 1 to 40 and record the distortion value. The visualizer shows us where the elbow lies in the graph.</p>

<p style = 'font-size:16px;font-family:Arial'>Vantage's ClearScape Analytics can easily integrate with 3rd party visualization python modules available like plotly, seaborn etc. We can do all the calculations and pre-processing on Vantage and pass only the necessary information to visualization tools, this will not only make the calculation faster but also reduce the time due to less data movement between tools. We do the data transfer for this and the subsequent visualizations wherever necessary.</p>

<hr style='height:1px;border:none;'>
<p style = 'font-size:18px;font-family:Arial'><b>6.2 Find the optimal number of clusters</b></p>

In [None]:
from yellowbrick.cluster import KElbowVisualizer
from sklearn.cluster import KMeans

model = KMeans()
visualizer = KElbowVisualizer(model, k=(1, 40))

<div class="alert alert-block alert-info" id="no-azure">
    <p style = 'font-size:16px;font-family:Arial'><i><b>Note</b>: Please be patient. We are currently performing some mathematical calculations. This process may take 3-5 minutes.</i></p>
</div>

In [None]:
visualizer.fit(embedding_column_list.to_pandas())
no_clusters = visualizer.elbow_value_
print(f"optimal number of clusters: {no_clusters}")
visualizer.show()

In [None]:
display(
    Markdown(
        f"""<p style = 'font-size:16px;font-family:Arial'>We observe that elbow lies at <b>{no_clusters}</b>, so thats the optimum number of clusters is <b>{no_clusters}</b>. With that information established we now begin the process of clustering</p>"""
    )
)

<hr style='height:1px;border:none;'>
<p style = 'font-size:18px;font-family:Arial'><b>6.3 Apply  K-Means Clusters using Teradata Vantage InDB Analytic Function</b></p>

<p style = 'font-size:16px;font-family:Arial'>The function performs fast K-Means clustering algorithm and returns cluster means
    and averages. we also need to find which clusters each of the 1000 customer reviews we took in the sample belongs to.</p>

In [None]:
from teradataml import KMeans

kmeans_model = KMeans(
    id_column="id",
    data=review_embeddings,
    target_columns=embedding_column_names,
    num_clusters=int(no_clusters),
    output_cluster_assignment=True,
)

<p style = 'font-size:16px;font-family:Arial'>To view the number of reviews per cluster. Verify cluster information from the KMeans Model, It shows clusters and number of entries in each cluster.</p>

In [None]:
selected_result = kmeans_model.result
selected_model_n = selected_result[selected_result.td_clusterid_kmeans.ge(0)]
selected_model_n.groupby("td_clusterid_kmeans").count()

In [None]:
df_final = selected_model_n.join(
    other=review_embeddings, on=["id"], how="inner", lsuffix="t1", rsuffix="t2"
)
df_final = df_final.drop(["id_t2"], axis=1)

<p style = 'font-size:16px;font-family:Arial'>Now, let's copy clustered data to SQL.</p>

<div class="alert alert-block alert-info" id="no-azure">
    <p style = 'font-size:16px;font-family:Arial'><i><b>Note</b>: Please be patient. The following code optimizes performance by temporarily transferring data between a dataframe and SQL. This process may take 30-50 seconds.</i></p>
    </div>

In [None]:
copy_to_sql(
    df_final,
    table_name="review_clustered",
    primary_index="id_t1",
    if_exists="replace",
)

# fetch to df
df_review_clustered = DataFrame("review_clustered")

<p style = 'font-size:16px;font-family:Arial'>The code below facilitates the visualization of all similar customer reviews in a single cluster, effectively grouping similar reviews together.</p>

In [None]:
df_cluster1 = df_review_clustered.loc[df_review_clustered.td_clusterid_kmeans == 1][
    ["id_t1", "td_clusterid_kmeans", "review"]
]


df_cluster1

<hr style='height:2px;border:none;'> 

<p style='font-size:20px;font-family:Arial'><b>7. Visualization of Clusters with Customer reviews</b></p> 

<p style='font-size:16px;font-family:Arial'>The graph displays the clustering of reviews into distinct groups. Based on the analysis, the data has been divided into n optimal clusters, each representing a unique pattern or category of complaints. This clustering approach helps to identify the key areas or types of reviews that are most prevalent, allowing for more targeted investigation and resolution efforts.</p>

In [None]:
clus = df_review_clustered.to_pandas().reset_index()

from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, random_state=42)
tsne_result = tsne.fit_transform(clus.iloc[:, 20:-1])

tsne_df = pd.DataFrame(tsne_result, columns=["tsne_1", "tsne_2"])
tsne_df["cluster_id"] = clus["td_clusterid_kmeans"]
tsne_df["id"] = clus["review"]

In [None]:
import pandas as pd
import plotly.express as px

# Assuming you have already computed tsne_df as per the previous example

# Create a new DataFrame combining t-SNE results with complaint information
tsne_complaint_df = pd.DataFrame(tsne_result, columns=["tsne_1", "tsne_2"])
tsne_complaint_df["cluster_id"] = clus["td_clusterid_kmeans"]
tsne_complaint_df["id"] = clus["id_t1"]
tsne_complaint_df["review"] = clus["review"]

# Truncate text for hover data
max_chars = 50  # Maximum characters to display
tsne_complaint_df["truncted_reviews"] = clus["review"].apply(
    lambda x: x[:max_chars] + "..." if len(x) > max_chars else x
)

# Plot using Plotly Express
fig = px.scatter(
    tsne_complaint_df,
    x="tsne_1",
    y="tsne_2",
    color="cluster_id",
    hover_data=["id", "truncted_reviews", "cluster_id"],
)

fig.update_traces(marker=dict(size=15))
fig.update_layout(
    title="t-SNE Visualization of Clusters with Review",
    xaxis_title="dimension-1",
    yaxis_title="dimension-2",
    xaxis=dict(tickangle=45),
    width=1000,
    height=800,
    hoverlabel=dict(bgcolor="white", font_size=16, font_family="Rockwell"),
    autosize=False,
)

# Customize the hovertemplate
fig.update_traces(
    hovertemplate="<b>ID:</b> %{customdata[0]}<br>"
    "<b>Review:</b> %{customdata[1]}<br>"
    "<b>Cluster ID:</b> %{customdata[2]}<br><extra></extra>"
)

fig.show()

<hr style='height:2px;border:none;'> 

<p style='font-size:20px;font-family:Arial'><b>8. Sentiment Analysis</b></p> 

<p style='font-size:16px;font-family:Arial'>Sentiment analysis using Generative AI involves leveraging cutting-edge technologies to extract insights from unstructured data. This process empowers businesses to swiftly identify and address customer concerns, enhancing overall customer satisfaction and loyalty.</p>

<p style='font-size:16px;font-family:Arial'>In this demo we'll use AWS Bedrock LLM model: <code>amazon.titan-text-premier-v1:0</code> which is very robust in text generation task</p>

<hr style="height:1px;border:none;">
<b style = 'font-size:18px;font-family:Arial'>8.1 Configuring AWS CLI</b>
<p style = 'font-size:16px;font-family:Arial'>The following cell will prompt us for the following information:</p>
<ol style = 'font-size:16px;font-family:Arial'>
<li><b>aws_access_key_id</b>: Enter your AWS access key ID</li>
<li><b>aws_secret_access_key</b>: Enter your AWS secret access key</li>
<li><b>region name</b>: Enter the AWS region you want to configure (e.g., us-east-1)</li>
<ol>

In [None]:
def configure_aws():
    print("configure the AWS CLI")
    # enter the access_key/secret_key
    access_key = getpass.getpass("aws_access_key_id ")
    secret_key = getpass.getpass("aws_secret_access_key ")
    region_name = getpass.getpass("region name")

    #set to the env
    !aws configure set aws_access_key_id {access_key}
    !aws configure set aws_secret_access_key {secret_key}
    !aws configure set default.region {region_name}

In [None]:
does_access_key_exists = !aws configure get aws_access_key_id

if len(does_access_key_exists) == 0:
    configure_aws()

In [None]:
!aws configure list

<hr style="height:1px;border:none;">
<b style = 'font-size:18px;font-family:Arial'>8.2 Initialize the Bedrock Model</b>
<ul style = 'font-size:16px;font-family:Arial'>
<li>The code below initializes a Boto3 client for the “bedrock-runtime” service.</li>
<li>The model can be used for natural language generation tasks.</li>
<li>Define the model id: <code>amazon.titan-text-premier-v1:0</code></li>
<ul>

In [None]:
import boto3

# Use the Conversation API to send a text message to Amazon Titan Text.
import boto3
from botocore.exceptions import ClientError

# Create a Bedrock Runtime client in the AWS Region you want to use.
client = boto3.client("bedrock-runtime", region_name="us-east-1")

# Set the model ID, e.g., Titan Text Premier.
model_id = "amazon.titan-text-premier-v1:0"

In [None]:
def get_conversation(txt):
    # Start a conversation with the user message.
    user_message = f"""User prompt: 
            The following is text from a review:
            {txt}
            Categorize the review as one of the following:

                    Positive
                    Negative
                    Neutral"""

    return [
        {
            "role": "user",
            "content": [{"text": user_message}],
        }
    ]

In [None]:
def sentiment_analysis(df):
    for i in tqdm(range(len(df))):
        try:
            # Send the message to the model, using a basic inference configuration.
            response = client.converse(
                modelId=model_id,
                messages=get_conversation(df.iloc[i, 1]),
                inferenceConfig={"maxTokens": 512, "temperature": 0.2, "topP": 0.9},
            )

            # Extract and print the response text.
            response_text = response["output"]["message"]["content"][0]["text"]
            df["Sentiment"][i] = response_text

        except (ClientError, Exception) as e:
            exit(1)
    return df

In [None]:
pdf_cluster1 = df_cluster1.to_pandas().reset_index()
pdf_cluster1["Sentiment"] = ""

In [None]:
response_df = sentiment_analysis(pdf_cluster1)

In [None]:
response_df.dropna()[["id_t1", "review", "Sentiment"]].head()

<hr style="height:1px;border:none;">
<b style = 'font-size:18px;font-family:Arial'>8.3 Sentiment Prediction Visualization</b>

<p style='font-size:16px;font-family:Arial'>This bar graph visualizes the distribution of sentiment predictions from a dataset, showing the total count for each sentiment category (Positive, Negative, Neutral). Created using Plotly Express, the graph includes text labels on each bar for clarity, highlighting the number of occurrences for each sentiment. This visualization provides a concise overview of sentiment trends in the dataset. </p>

In [None]:
from collections import Counter

data = Counter(response_df["Sentiment"])

# Convert Counter data to DataFrame
viz_df = pd.DataFrame.from_dict(data, orient="index", columns=["Count"]).reset_index()


# Rename columns
viz_df.columns = ["Prediction", "Count"]


# Create bar graph using Plotly Express
fig = px.bar(
    viz_df,
    x="Prediction",
    y="Count",
    color="Prediction",
    labels={"Count": "Number of Occurrences", "Prediction": "Prediction"},
    text="Count",
)

# Update layout to show text labels on bars
fig.update_traces(texttemplate="%{text}", textposition="inside")

# Show the plot
fig.show()

<hr style="height:1px;border:none;">
<b style = 'font-size:18px;font-family:Arial'>8.4 Word Cloud for Negative Reviews Prediction</b>

<p style='font-size:16px;font-family:Arial'>Unlock the power of customer feedback with intuitive word cloud visualization, which provides a comprehensive snapshot of negative reviews sentiment. This innovative tool highlights the most frequently occurring words and pain points in customer feedback, empowering businesses to:</p>

<ol style='font-size:16px;font-family:Arial'>
    <li>Identify trends and sentiment patterns</li>
    <li>Pinpoint areas for improvement</li>
    <li>Make data-driven decisions to enhance customer satisfaction and loyalty</li>
</ol>

<p style='font-size:16px;font-family:Arial'>By leveraging this word cloud, businesses can proactively address customer concerns, refine their products and services, and ultimately drive growth through a deeper understanding of their customers' needs and preferences.</p>

In [None]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt


text = response_df[response_df["Sentiment"] == "Negative"]
combine_text = " ".join(text["review"])

# Replace 'X' with blank space
modified_string = combine_text.replace("X", "").replace("Discover", "")

wordcloud = WordCloud(collocations=False, background_color="white").generate(
    modified_string
)

# Display the word cloud
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

<hr style='height:2px;border:none;'> 

<p style='font-size:20px;font-family:Arial'><b>9. Topic Modelling</b></p> 

<p style='font-size:16px;font-family:Arial'>Topic modeling in the context of categorizing reviews involves assigning each review to one of three categories: functionality, comparison, and spam. This classification helps in organizing the reviews based on their content and relevance, allowing for better analysis and understanding. By counting the frequency of reviews in each category, businesses can gain insights into the primary concerns of their customers and identify potential spam.</p>

In [None]:
def get_conversation(txt):
    # Start a conversation with the user message.
    user_message = f"""User prompt: 
            The following is text from a customer review:
            {txt}
        
            
            Instructions for Topic:
            - The review falls into only one of the following Topics: Comparisons, Functionality, Spam
            - Most critical: Must select atleast one topic
            - Only select one Topic one out of these "Comparisons", "Functionality", "Spam"

            Most important: Choose one topic from the following options: Comparisons, Functionality, or Spam."""

    return [
        {
            "role": "user",
            "content": [{"text": user_message}],
        }
    ]

In [None]:
def topic_modelling(df):
    for i in tqdm(range(len(df))):

        try:
            # Send the message to the model, using a basic inference configuration.
            response = client.converse(
                modelId=model_id,
                messages=get_conversation(df.iloc[i, 1]),
                inferenceConfig={"maxTokens": 512, "temperature": 0.2, "topP": 0.9},
            )

            # Extract and print the response text.
            response = response["output"]["message"]["content"][0]["text"]

            df["Predicted_Topic"][i] = response

        except (ClientError, Exception) as e:
            exit(1)

    return df

In [None]:
pdf_cluster1["Predicted_Topic"] = ""

response_df = topic_modelling(pdf_cluster1)

In [None]:
response_df = response_df.dropna()
response_df["Predicted_Topic"] = (
    response_df["Predicted_Topic"].str.strip().str.replace(r"^,+|,+$", "", regex=True)
)
response_df.head()

<hr style="height:1px;border:none;">
<b style = 'font-size:18px;font-family:Arial'>9.1 Review Topics Prediction vs Occurrences</b>

<p style='font-size:16px;font-family:Arial'>A graph illustrating the relationship between review sentiments (positive, negative, neutral) prediction and the number of occurrences. This visual representation helps identify trends, patterns, and areas for improvement, enabling data-driven decision making.</p>

In [None]:
from collections import Counter

data = Counter(response_df["Predicted_Topic"])

# Convert Counter data to DataFrame
viz_df = pd.DataFrame.from_dict(data, orient="index", columns=["Count"]).reset_index()

# Rename columns
viz_df.columns = ["Prediction", "Count"]

# Create bar graph using Plotly Express
fig = px.bar(
    viz_df,
    x="Prediction",
    y="Count",
    color="Prediction",
    labels={"Count": "Number of Occurrences", "Prediction": "Prediction"},
    text="Count",
)


# Update layout to show text labels on bars
fig.update_traces(texttemplate="%{text}", textposition="inside")

# Show the plot
fig.show()

<hr style='height:2px;border:none;'>
<b style = 'font-size:20px;font-family:Arial'>10. Cleanup</b>

<hr style="height:1px;border:none;">
<p style = 'font-size:18px;font-family:Arial'><b>10.1 Work Tables</b></p>
<p style = 'font-size:16px;font-family:Arial'>Cleanup work tables to prevent errors next time.</p>

In [None]:
# Loop through the list of tables and execute the drop table command for each table
for table in db_list_tables()['TableName'].tolist():
    try:
        db_drop_table(table_name=table, schema_name="demo_user")
    except:
        pass

<hr style="height:1px;border:none;">
<p style = 'font-size:18px;font-family:Arial'> <b>10.2 Databases and Tables </b></p>
<p style = 'font-size:16px;font-family:Arial'>The following code will clean up tables and databases created above.</p>

In [None]:
%run -i ../run_procedure.py "call remove_data('DEMO_CustomerReviews');"        # Takes 5 seconds

In [None]:
remove_context()

<a id="section102"></a>
    
<hr style="height:2px;border:none;">

<b style = 'font-size:20px;font-family:Arial'>Required Materials</b>
<p style = 'font-size:16px;font-family:Arial'>Let’s look at the elements we have available for reference for this notebook:</p>

<p style = 'font-size:18px;font-family:Arial'><b>Filters:</b></p>
<ul style = 'font-size:16px;font-family:Arial'>
    <li><b>Industry:</b> ECommerce</li>
    <li><b>Functionality:</b> Generative AI</li>
    <li><b>Use Case:</b> Review Analysis</li>
</ul>

<p style = 'font-size:18px;font-family:Arial'><b>Related Resources:</b></p>

<ul style = 'font-size:16px;font-family:Arial'>
    <li><a href='https://www.teradata.com/Blogs/Fraud-Busting-AI'>Fraud-Busting-AI</a></li>
    <li><a href='https://www.teradata.com/Industries/Financial-Services'>Financial Services</a></li>
    <li><a href='https://www.teradata.com/Resources/Datasheets/Move-from-Detection-to-Prevention-and-Outsmart-Fraudsters'>Move from Detection to Prevention and Outsmart Tech-Savvy Fraudsters</a></li>
</ul>

<b style = 'font-size:20px;font-family:Arial'>Dataset:</b>

- `id`: Unique row id
- `review`: customer review
- `spam` : spam or not spam (Numeric)

<b style = 'font-size:16px;font-family:Arial'>Dataset source: <a href="https://www.tpc.org/tpcx-ai/default5.asp">TPCx-AI</a></b>

<p style = 'font-size:18px;font-family:Arial'><b>Links:</b></p>
<ul style = 'font-size:16px;font-family:Arial'>
    <li>Teradataml Python reference: <a href = 'https://docs.teradata.com/search/all?query=Python+Package+User+Guide&content-lang=en-US'>here</a></li>
    <li>OpenAI embeddings reference: <a href='https://platform.openai.com/docs/guides/embeddings'>here</a></li>
</ul>

<footer style="padding-bottom:35px; border-bottom:3px solid #91A0Ab">
    <div style="float:left;margin-top:14px">ClearScape Analytics™</div>
    <div style="float:right;">
        <div style="float:left; margin-top:14px">
            Copyright © Teradata Corporation - 2023, 2024. All Rights Reserved
        </div>
    </div>
</footer>