<header>
   <p  style='font-size:36px;font-family:Arial; color:#F0F0F0; background-color: #00233c; padding-left: 20pt; padding-top: 20pt;padding-bottom: 10pt; padding-right: 20pt;'>
       Natural Languauge Processing in Vantage
  <br>
       <img id="teradata-logo" src="https://storage.googleapis.com/clearscape_analytics_demo_data/DEMO_Logo/teradata.svg" alt="Teradata" style="width: 125px; height: auto; margin-top: 20pt;">
    </p>
</header>

<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>Introduction</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>
Natural Language Processing (NLP) involves teaching computers to understand, interpret, and generate human language, just like people do. It's about enabling computers to read and understand text, so they can perform tasks that involve language, such as answering questions, understanding customer feedback, or even generating human-like responses.<br>Think of NLP as a translator between humans and computers. It allows computers to analyze and make sense of text data in a way that's meaningful for businesses. There are many uses of NLP that can be used in business for example <br><b>Customer Insights</b>: NLP helps businesses understand what their customers are saying across different channels like emails, reviews, or social media. It can analyze this text to identify trends, sentiments, and common issues, helping companies tailor their products and services to meet customer needs better.<br><b>Automated Support</b>: NLP powers chatbots and virtual assistants that can understand and respond to customer queries in real-time. These assistants can handle routine inquiries, provide product recommendations, or even troubleshoot problems, freeing up human agents for more complex tasks.<br><b>Information Extraction</b>: NLP can extract valuable information from unstructured text data, such as contracts, legal documents, or research papers. It helps businesses quickly find relevant information, identify key insights, and make informed decisions based on this data.<br><b>Personalization</b>: By analyzing customer interactions and preferences expressed in text, NLP enables businesses to personalize their marketing messages, offers, and user experiences. This personalized approach can lead to higher customer engagement and loyalty.<br><br>In essence, NLP empowers businesses to leverage the power of language to improve customer experiences, streamline operations, and drive better decision-making. 
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Business Values</b></p>
<ul style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>Efficiency: Automating sentiment analysis saves time and resources compared to manual analysis.</li>
    <li>Insights: Gain valuable insights into customer sentiment and preferences to drive strategic decision-making.</li>
    <li>Proactive Response: Identify and address customer concerns and issues in real-time to improve customer satisfaction and loyalty.</li>
    <li>Competitive Advantage: Stay ahead of competitors by continuously monitoring and adapting to changing customer sentiments and market trends.</li>
 </ul>
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Why Vantage?</b></p>  
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Natural Language Processing deals with huge amounts of data; doing pre-processing and using Vantage's InDB text processing functions saves time and can be easily scaled to meet business needs. Moreover, using Clearscape Analytics it is very easy to integrate widely used 3rd party LLM models like GPT with trusted business data.<br>In this demo we will process the comments received by a retail store using Vantage's InDb functions.  </p>

<hr style='height:2px;border:none;background-color:#00233C;'>
<b style = 'font-size:20px;font-family:Arial;color:#00233c'>1. Configuring the environment</b>

In [None]:
%%capture
# '%%capture' suppresses the display of installation steps of the following packages
!pip install -r requirements.txt --quiet

<div class="alert alert-block alert-info">
    <p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>Note: </b><i>The above statements will install the required libraries to run this demo. Be sure to restart the kernel after executing the above lines to bring the installed libraries into memory. The simplest way to restart the Kernel is by typing zero zero: <b> 0 0</b></i></p>
    </div>

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>2. Connect to Vantage</b></p>

In [None]:
# Suppress warnings
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 


#import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
import getpass

import timeit
import tqdm
from tqdm.notebook import *
tqdm_notebook.pandas()

# teradata lib
from teradataml import *
import teradataml
from teradataml import configure
from teradataml.analytics.valib import *
configure.val_install_location = "val"

# helper
from utils.sql_helper_func import *

display.max_rows = 5
display.print_sqlmr_query=False
display.suppress_vantage_runtime_warnings=True

# markdown
from IPython.display import display, Markdown

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We will be prompted to provide the password. We will enter the password, press the Enter key, and then use the down arrow to go to the next cell.</p>

In [None]:
%run -i ../startup.ipynb
eng = create_context(host = 'host.docker.internal', username = 'demo_user', password = password)
print(eng)

In [None]:
%%capture
execute_sql('''SET query_band='DEMO=Natural_Language_Processing_Python.ipynb;' UPDATE FOR SESSION; ''')

<p style = 'font-size:18px;font-family:Arial;color:#00233C'> <b>Getting Data for This Demo</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We have provided data for this demo on cloud storage. We have the option of either running the demo using foreign tables to access the data without using any storage on our environment or downloading the data to local storage, which may yield somewhat faster execution. However, we need to consider available storage. There are two statements in the following cell, and one is commented out. We may switch which mode we choose by changing the comment string.</p>

In [None]:
%run -i ../run_procedure.py "call get_data('DEMO_Retail_local');"
 # takes about 2 minute 30 seconds, estimated space: 90 MB
#%run -i ../run_procedure.py "call get_data('DEMO_Retail_cloud');" 
# takes about 30 seconds, estimated space: 0 MB

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Optional step – We should execute the below step only if we want to see the status of databases/tables created and space used.</p>

In [None]:
%run -i ../run_procedure.py "call space_report();"

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>3. Data Exploration</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Please check the db version should be above 17.20.03.21 for the functions to work correctly. If the database version is less than that, please create a new VM in Clearscape Analytics Experience.</p>

In [None]:
configure.database_version

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Let us start by creating a "Virtual DataFrame" that points directly to the dataset in Vantage.</p>

In [None]:
tdf_reviews = DataFrame('"DEMO_Retail"."Web_Comment"')
tdf_reviews

In [None]:
tdf_reviews.shape

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We have around 23k comments in our dataset. Let us first remove the null comments.</p>

In [None]:
tdf_nonull = tdf_reviews[tdf_reviews.comment_text.isnull() == 0]

<p style = 'font-size:16px;font-family:Arial;color:#00233C'> For demo purpose we will use 5k comments for our analysis.</p>

In [None]:
tdf_sample = tdf_nonull.iloc[:5000, :]

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>4. Sentiment Extraction</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Sentiment Extraction is the process of analyzing large volumes of text to determine whether it expresses a positive, negative, or neutral sentiment.<br> Clearscape Analytics SentimentExtractor 
uses a dictionary model to extract the sentiment (positive, negative, or neutral) of each input document or sentence. The  dictionary model consists of WordNet( a lexical database of the English language).The function handles negated sentiments as follows:<ul style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>-1 if the sentiment is negated. For example, I am not happy.</li>
    <li>-1 if one word separates the sentiment and a negation word. For example, I am not very happy.</li>
    <li>+1 if two or more words separate the sentiment and a negation word. For example, I am not saying I am happy.</li></ul>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Let us use this function to see the overall sentiments of comments received in our sample dataset.</p>     

In [None]:
sentimentextractor_out = SentimentExtractor(text_column="comment_text",
                                                data=tdf_sample,
                                                accumulate=['comment_id', 'comment_text']
                                                )

senti = sentimentextractor_out.result
senti

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>As we can see that the function outputs the polarity , sentiment score and the sentiment words on which it calculated the score. In the sentiment_words output column the function returns the overall positive and negative scores, the words which are used for soring the sentiment -1 for negative and +1 for positive sentiment and in the brackets it also displays how many times the word is repeated in the comment. e.g beautiful 1 (2) means beautiful is positive sentiment word and has occured twice in the comment. We also have an option of providing the custom dictonary to the function.</p>

In [None]:
d1=senti.select(['comment_id','polarity']).groupby('polarity').count()
d1 = d1.assign(drop_columns=True,
          Polarity=d1.polarity,
          Count=d1.count_comment_id)
d1

In [None]:
plot1 = d1.to_pandas()
# Create a bar plot
ax = plot1.plot(kind='bar', x='Polarity', y='Count', color='skyblue', edgecolor='black', figsize=(8, 6))

# Add count labels on top of the bars
for p in ax.patches:
    ax.annotate(str(p.get_height()), (p.get_x() + p.get_width() / 2., p.get_height()),
                ha='center', va='center', xytext=(0, 10), textcoords='offset points')

# Rotate x-labels by 45 degrees
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha='right')

# Add labels and title
plt.title('Polarity of comments')
plt.xlabel('Polarity')
plt.ylabel('Counts')

# Show the plot
plt.show()

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>From the above chart we can see that the comments are largely postive in sentiment.</p>

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>5. Integrating with OpenSource LLM and create Word Embeddings</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>
Word embedding is a technique used in natural language processing (NLP) to represent words as dense vectors. This allows words with similar meanings to have similar representations. Word embeddings capture semantic relationships between words, enabling NLP models to better understand and process human language.<br><br>Traditional methods of representing words, such as one-hot encoding or bag-of-words, represent each word as a sparse vector where most elements are zero and only one element is one (for one-hot encoding) or a count of occurrences (for bag-of-words). These representations do not capture semantic similarity between words and can result in high-dimensional and sparse feature spaces.<br> <br>Word embeddings represent words as dense vectors of fixed dimensionality (e.g., 100, 200, or 300 dimensions) where each dimension represents a different aspect of the word's meaning. These vectors are learned from large corpora of text using techniques like Word2Vec, GloVe, or FastText.The key idea behind word embeddings is that words that occur in similar contexts tend to have similar meanings. By training word embeddings on large text corpora, the model learns to map words with similar meanings to nearby points in the vector space. For example, in a well-trained word embedding model, the vectors for "male" and "female" are expected to be closer to each other than to the vector for "apple".
</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'> Large Language Models are trained on huge amounts of data enabling them to learn patterns, grammar, and context from a wide range of topics. They can be fine-tuned for specific tasks, such as question-answering, natural language understanding, and text generation etc and have a wide range of uses across various domains due to their ability to understand and generate human-like text.<br>In this demo we will use text-embedding-3-small model from OpenAI for our embeddings generation. Clearscape analytics can integrate with any opensource, OpenAI or cloud provider specific LLM (AWS Sagemaker/AWS Bedrock). Please refer demo index for demos on other integrations with TDApiClient. </p>


<p style='font-size:16px;font-family:Arial;color:#00233C'>OpenAI and Azure OpenAI, provide multiple APIs for our hosted models. We introduce integration with the embedding API, which can be used in various types of applications: Classification, Search, Recommendations, and Anomaly detection. For more information on our Teradata API Integration, click <a href='https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/Teradata-VantageTM-API-Integration-Guide-for-Cloud-Machine-Learning/Teradata-Partner-API/Welcome-to-Teradata-API-Integration'>here.</a></p>

<p style='font-size:16px;font-family:Arial;color:#00233C'>Under the hood, we will utilize the OpenAI embeddings method to generate the embeddings. OpenAI embeddings are a type of word embedding that we can use to represent products in a way that captures their semantic meaning. To generate embeddings for a product table, we will use the product name field. We will employ the OpenAI Embeddings API to generate embeddings for each product. Please refer to the <a href="https://platform.openai.com/docs/guides/embeddings"> Embeddings documentation</a> for more information about embeddings and types of models available.</p>

<p style='font-size:16px;font-family:Arial;color:#00233C'>The OpenAI Embeddings API takes a text string as input and returns a vector of numbers that represent the embedding. The length of the vector depends on the model that we are using. For example, the <b>text-embedding-3-small</b> model returns a vector of 1536 numbers.</p>

In [None]:
df1=tdf_sample.drop(['comment_summary'],axis=1)


<div class="alert alert-block alert-warning">
    <p style = 'font-size:16px;font-family:Arial;color:#00233C'><i><b>Note:</b> In this section, we are creating the OpenAI embeddings for 5000 comment texts. It will cost us a few dollars on our OpenAI account.</i></p>
</div>

<hr style='height:1px;border:none;background-color:#00233C;'>
<p style = 'font-size:18px;font-family:Arial;color:#00233c'><b>5.1 Get the OpenAI API key</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We will first create function which will create embeddings from the opensouce model we are using. We are using <b>text-embedding-3-small</b> model you are free to change the model as per your needs.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>In order to utilize this demo, you will need an OpenAI API key. If you do not have one, please refer to the instructions provided in this guide to obtain your OpenAI API key: </p>


<a href="..//Openai_setup_api_key/Openai_setup_api_key.md" style="text-decoration:none;" target="_blank"><button style="font-size:16px;font-family:Arial;color:#fff;background-color:#00233C;border:none;border-radius:5px;cursor:pointer;height:50px;line-height:50px;display:flex;align-items:center;">OpenAI API Key Guide <span style="margin-left:10px;">&#8658;</span></button>
</a>

In [None]:
import getpass

# enter your openai api key
api_key = getpass.getpass("\n Please Enter OpenAI API key: ")

In [None]:
from tdapiclient import TDApiClient, create_tdapi_context

# set embedding model from openai models
embedding_model = "text-embedding-3-small"

def generate_embeddings_tdapiclient(tdf, api_key, text_column):
    return TDApiClient.API_Request(dataframe=tdf, 
                                   api_type="open-ai-embedding",
                                   model_name=embedding_model,
                                   authorization ='''{{"Key":"{}"}}'''.format(api_key),
                                   text_column=text_column)

<div class="alert alert-block alert-warning">
    <p style = 'font-size:16px;font-family:Arial;color:#00233C'><i><b>Note:</b> In this section, we are creating the OpenAI embeddings for 5000 comment text. It will cost us a few dollars on our OpenAI account.</i></p>
</div>

<hr style='height:1px;border:none;background-color:#00233C;'>

<p style = 'font-size:18px;font-family:Arial;color:#00233c'><b>5.2 Do you want to generate the embeddings?</b></p>    
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We have already generated embeddings for the comment text and stored them in <b>Vantage</b> table.</p>

<center><img src="images/decision_emb_gen_cmt2.svg" alt="embeddings_decision" width=300 height=400/></center>

<div class="alert alert-block alert-info">
    <p style = 'font-size:16px;font-family:Arial;color:#00233C'><i><b>Note: If you would like to skip the embedding generation step to save the time and move quickly to nect step, please enter "No" in the next prompt.</b></i></p>
</div>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>To save time, you can move to the already generated embeddings section. However, if you would like to see how we generate the embeddings, or if you need to generate the embeddings for a different dataset, then continue to the following section.</p>

In [None]:
def get_section52_desc_start():
    return """<div class="alert alert-block alert-warning">
    <p style = 'font-size:16px;font-family:Arial;color:#00233C'><i><b>Please be patient, The embedding generation step is estimated to take approximately 5 minutes to complete. </b></i></p>
</div>"""

In [None]:
user_conformation = input("Do you want to generate the embeddings? (yes/no): ")

if user_conformation.lower() == 'yes':
    print("Generating the embeddings...")
    display(Markdown(get_section52_desc_start()))
    comment_embeddings = generate_embeddings_tdapiclient(df1, api_key, "comment_text")
    copy_to_sql(comment_embeddings, table_name="comment_embeddings",primary_index='comment_id', if_exists='append')
    comment_embeddings = DataFrame("comment_embeddings")
    print("Embeddings generated successfully.")
else:
    print("Loading the embeddings from Vantage table...")
    comment_embeddings = DataFrame(in_schema('DEMO_Retail', 'Comment_Embeddings'))  
    print("Existing embeddings loaded to Comment_Embeddings table.")

In [None]:
print("Data information: \n",comment_embeddings.shape)
comment_embeddings

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>6. Kmeans clustering using the embeddings</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>In the previous step we have created embeddings from the text data we have. The generated embeddings are the features that can be used in machine learning algorithms. We will use Kmeans clustering to categorize the comments in different clusters.<br>First let us start by generating columnlist to be used in KMeans function.</p>

In [None]:
embedding_column_list = comment_embeddings.columns
embedding_column_list.remove("comment_id")
embedding_column_list.remove("customer_id")
embedding_column_list.remove("comment_text")

In [None]:
# Run KMeans to find the clustering based on embeddings.
kmeans_out = KMeans(
    id_column="comment_id",
    data=comment_embeddings,
    target_columns=embedding_column_list,
    output_cluster_assignment=True,
    num_clusters=7
)

In [None]:
kmeans_df=kmeans_out.result
kmeans_df

In [None]:
d2 = kmeans_df.groupby('td_clusterid_kmeans').count()
d2

In [None]:
# Convert to pandas DataFrame 
plot2 = d2.to_pandas()

# Plotting the bar chart
ax = plot2.plot(kind='bar', x='td_clusterid_kmeans', y='count_comment_id', color='skyblue', edgecolor='black', figsize=(8, 6))

# Add labels and title
plt.title('Comments in each cluster')
plt.xlabel('Cluster_Id')
plt.ylabel('Counts')

# Show count on top of bars
for p in ax.patches:
    ax.annotate(str(p.get_height()), (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='center', xytext=(0, 10), textcoords='offset points')

# Rotate x-labels by 45 degrees
plt.xticks(rotation=360)

# Show the plot
plt.show()


<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>7. PCA</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Principal Component Analysis (PCA) is a technique used for dimensionality reduction in data analysis and machine learning. It works by transforming the original high-dimensional data into a lower-dimensional space while retaining as much of the original variance as possible. PCA achieves this by identifying the principal components, which are the directions in feature space along which the data varies the most. These principal components are computed as the eigenvectors of the covariance matrix of the standardized data, and they represent the most significant sources of variation in the data. By selecting a subset of the principal components that capture the most variance, PCA allows for a more compact representation of the data while preserving its essential structure and relationships. The transformed data can be used for visualization, feature extraction, noise reduction, and other analysis tasks, making PCA a powerful tool for data exploration and dimensionality reduction.</p>

In [None]:
pca_df=comment_embeddings.join(other = kmeans_df, on = ["comment_id"], how = "inner",lprefix = "emb", rprefix = "kmeans")
pca_df

In [None]:
pca_obj = valib.PCA(data=pca_df,
                        columns=embedding_column_list)

<div class="alert alert-block alert-info" id="no-azure">
    <p style = 'font-size:16px;font-family:Arial;color:#00233C'><i><b>Note</b>: Please be patient. We are currently performing some mathematical calculations. This process may take 3-5 minutes.</i></p>
</div>

In [None]:
# Get PCA scores using the model generated above
obj = valib.PCAPredict(data=pca_df,
                           model=pca_obj.result,
                           index_columns="emb_comment_id")

In [None]:
# Print the results
obj.result

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>As we see from above the PCA has reduced 300+ embeddings to 80+ embeddings. We will plot the first 2 factors to see how our clusters looks.</p>

In [None]:
out_reduced_df = obj.result.select(['emb_comment_id','Factor 1','Factor 2'])

In [None]:
out_reduced_df

In [None]:
# Join the KMeans output with dataframe with reduced number of columns.
final_df=kmeans_df.join(other = out_reduced_df, on = ["comment_id = emb_comment_id"], how = "inner",lprefix = "l", rprefix = "r")
final_df

In [None]:
plot3=final_df.to_pandas().reset_index()

In [None]:
plt.figure(figsize=(10, 8))
scatter = plt.scatter(plot3['Factor 1'], plot3['Factor 2'], c=plot3['td_clusterid_kmeans'], cmap='viridis')
plt.title('PCA Visualization of Clusters')
plt.legend(*scatter.legend_elements(), title='Clusters')
plt.show()

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>8. Term Frequency-Inverse Document Frequency</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>
TF-IDF (Term Frequency-Inverse Document Frequency) is a numerical statistic used in NLP to evaluate the importance of a word in a document relative to a collection of documents. It is calculated by multiplying two factors: the term frequency (TF), which measures how frequently a word occurs in a document, and the inverse document frequency (IDF), which penalizes words that are common across multiple documents in the collection. TF-IDF assigns higher weights to words that are frequent in a document but rare in other documents, allowing it to capture the discriminative power of words in distinguishing documents. This technique is commonly used for text mining, document classification, search engine ranking, and other tasks where the relevance of words needs to be assessed within a corpus of text data.</p>

In [None]:
title_input=tdf_sample.join(other = kmeans_df, on = ["comment_id"],how = "inner",lprefix = "l", rprefix = "r")

In [None]:
title_df = title_input.assign(
    drop_columns=True,
    comment_id=title_input.l_comment_id,
    customer_id=title_input.customer_id,
    comment_text=title_input.comment_text,
    comment_summary=title_input.comment_summary,
    cluster_id=title_input.td_clusterid_kmeans
 )

In [None]:
title_df

In [None]:
copy_to_sql(df = title_df, table_name = 'title_comments', if_exists = 'replace',primary_index = "comment_id")

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>
A text parser, also known as a text tokenizer, breaks a text into its constituent parts, such as words, phrases, sentences, or other meaningful units. The <b>TD_TextParser</b> function performs the following operations:
<ul style = 'font-size:16px;font-family:Arial;color:#00233C'><li>Tokenizes the text in the specified column</li>
    <li>Removes the punctuations from the text and converts the text to lowercase</li>
    <li>Removes stop words from the text and converts the text to their root forms</li>
    <li>Creates a row for each word in the output table</li>
    <li>Performs stemming; that is, the function identifies the common root form of a word by removing or replacing word suffixes</li>
    </ul>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'> The output table generated from the TD_TextParser is fed to the <b>TD_TFIDF</b> function. TD_TFIDF function represents each document as an N-dimensional vector, where N is the number of terms in the document set (therefore, the document vector is sparse). Each entry in the document vector is the TF-IDF score of a term.</p>    

In [None]:
qry1='''
CREATE MULTISET TABLE tfidf_input_tokenized AS (
SELECT comment_id, cast(token as varchar(15)) as token, cluster_id FROM TD_TextParser (
ON title_comments AS InputTable
USING
TextColumn ('comment_text')
ConvertToLowerCase ('true')
OutputByWord ('true')
Punctuation ('\[.,-?\!\]')
RemoveStopWords ('true')
StemTokens ('true')
Accumulate ('comment_id','cluster_id')
) AS dt ) WITH DATA;
'''

qry2='''CREATE MULTISET TABLE tfidf_comments AS (
SELECT * FROM TD_TFIDF (
   ON tfidf_input_tokenized  AS InputTable
   USING
   DocIdColumn ('cluster_id')
   TokenColumn ('token')
   TFNormalization ('LOG')
   IDFNormalization ('SMOOTH')
   Regularization ('L2')
   --Accumulate ('cluster_id')
) AS dt ) WITH DATA;
'''

# Execute the query
execute_sql(qry1)
execute_sql(qry2)


In [None]:
tfidf_comments = DataFrame("tfidf_comments")

In [None]:
tfidf_comments

In [None]:
window = tfidf_comments.window(partition_columns="cluster_id",
                               order_columns="TD_TF_IDF"
                              )

# Execute rank() on a window.
df = window.rank()
df.sort('col_rank')

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>From above we can see the frequency and importance of each word in the cluster.</p>

<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>Conclusion</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>In this demo we have seen how we can do analysis and pre-processing of the text data in Vantage using InDb functions and integrating with 3rd party LLM models. </p>

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>9. Cleanup</b></p>

<p style = 'font-size:18px;font-family:Arial;color:#00233C;color:#00233C'><b>Work Tables</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C;'>
We need to clean up our work tables to prevent errors next time.

In [None]:
tables = ['tfidf_comments','tfidf_input_tokenized','title_comments','comment_embeddings']

# Loop through the list of tables and execute the drop table command for each table
for table in tables:
    try:
        db_drop_table(table_name = table)
    except:
        pass

<p style = 'font-size:18px;font-family:Arial;color:#00233C;color:#00233C'><b>Databases and Tables</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We will use the following code to clean up tables and databases created for this demonstration.</p>

In [None]:
%run -i ../run_procedure.py "call remove_data('DEMO_Retail');" 
#Takes 20 seconds

In [None]:
remove_context()

<hr style="height:2px;border:none;background-color:#00233C;">

<b style = 'font-size:20px;font-family:Arial;color:#00233C'>Required Materials</b>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Let’s look at the elements we have available for reference for this notebook:</p>

<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Filters:</b></p>
    <ul style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li><b>Industry:</b> Retail</li>
    <li><b>Functionality:</b> Text Analysis</li>
    <li><b>Use Case:</b> Natural Language Processing</li>
    </ul>
    <p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Related Resources:</b></p>
    <ul style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li><a href = 'https://www.teradata.com/Blogs/NPS-is-a-metric-not-the-goal'>·In the fight to improve customer experience, NPS is a metric, not the goal</a></li>
    <li><a href = 'https://www.teradata.com/insights/ai-and-machine-learning/using-natural-language-to-query-teradata-vantagecloud-with-llms'>·Using Natural Language to query Teradata Vantage Cloud with LLMs</a></li>
    </ul>

<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Reference Links:</b></p>
<ul style = 'font-size:16px;font-family:Arial;color:#00233C'> 
       <li>Teradata Vantage™ - Analytics Database Analytic Functions - 17.20: <a href = 'https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/Teradata-VantageTM-Analytics-Database-Analytic-Functions-17.20/Introduction-to-Analytics-Database-Analytic-Functions '>https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/Teradata-VantageTM-Analytics-Database-Analytic-Functions-17.20/Introduction-to-Analytics-Database-Analytic-Functions </a></li>    
  <li>Teradata® Package for Python User Guide - 17.20: <a href = 'https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/Teradata-Package-for-Python-User-Guide-17.20/Introduction-to-Teradata-Package-for-Python'>https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/Teradata-Package-for-Python-User-Guide-17.20/Introduction-to-Teradata-Package-for-Python</a></li>
  <li>Teradata® Package for Python Function Reference - 17.20: <a href = 'https://docs.teradata.com/r/Enterprise/Teradata-Package-for-Python-Function-Reference-17.20/Teradata-Package-for-Python-Function-Reference'>https://docs.teradata.com/r/Enterprise/Teradata-Package-for-Python-Function-Reference-17.20/Teradata-Package-for-Python-Function-Reference</a></li>      
  <li>Teradata® API Integration Guide for Cloud Machine Learning: <a href = 'https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/Teradata-VantageTM-API-Integration-Guide-for-Cloud-Machine-Learning/Teradata-Partner-API'>https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/Teradata-VantageTM-API-Integration-Guide-for-Cloud-Machine-Learning/Teradata-Partner-API</a></li>    
</ul>

<footer style="padding-bottom:35px; background:#f9f9f9; border-bottom:3px solid #00233C">
    <div style="float:left;margin-top:14px">ClearScape Analytics™</div>
    <div style="float:right;">
        <div style="float:left; margin-top:14px">
            Copyright © Teradata Corporation - 2024. All Rights Reserved
        </div>
    </div>
</footer>