<header>
   <p  style='font-size:36px;font-family:Arial; color:#F0F0F0; background-color: #00233c; padding-left: 20pt; padding-top: 20pt;padding-bottom: 10pt; padding-right: 20pt;'>
       Semantic Clustering using Open Source Language Models in Database
  <br>
       <img id="teradata-logo" src="https://storage.googleapis.com/clearscape_analytics_demo_data/DEMO_Logo/teradata.svg" alt="Teradata" style="width: 125px; height: auto; margin-top: 20pt;">
    </p>
</header>

<p style = 'font-size:20px;font-family:Arial'><b>Introduction:</b></p>
<p style = 'font-size:16px;font-family:Arial'>Semantic Clustering measures the degree to which two pieces of text relate in meaning, regardless of the exact wording. This captures relationships between words, sentences, or documents in ways that traditional keyword-based methods might miss.</p>
<ul style = 'font-size:16px;font-family:Arial'>
            <li><strong>Information Retrieval:</strong> Improves search engines by retrieving documents or results that are semantically related to the user's query.</li>
            <li><strong>Text Classification:</strong> Categorizes text into predefined classes based on meaning, useful in spam detection, sentiment analysis, etc.</li>
            <li><strong>Question Answering:</strong> Matches questions to relevant answers by understanding their meaning.</li>
            <li><strong>Recommendation Systems:</strong> Suggests items (like products or content) based on similarity in user preferences or behaviors.</li>
            <li><strong>Plagiarism Detection:</strong> Identifies copied content even if paraphrased</li>
</ul>

<p style = 'font-size:18px;font-family:Arial'><b>Why Vantage?</b></p>

<p style = 'font-size:16px;font-family:Arial'>
    Teradata’s integration with <b>LLMs and hosting capabilities in-DB</b>, along with the Open Analytics Framework, would enable customers to run NLP models at scale. The key challenges noted for on-prem customers—such as data movement latency and lack of access to cloud models—are valid. By bringing language models within Vantage, Teradata can provide a significant advantage to on-prem customers by allowing them to run NLP models without needing to move large amounts of data to and from external services.
</p>

<hr style="height:1px;border:none;">
<p style = 'font-size:18px;font-family:Arial'><b>Downloading and installing additional software needed</b>

In [None]:
%%capture

!pip install wordcloud nltk --quiet

In [None]:
%%capture

!pip install --force-reinstall pillow --quiet

<div class="alert alert-block alert-info">
<p style = 'font-size:16px;font-family:Arial'><b>Note: </b><i>The above libraries have to be installed. Restart the kernel after executing these cells to bring the installed libraries into memory. The simplest way to restart the Kernel is by typing <b> 0 0</b></i> (zero zero) and pressing <i>Enter</i>.</p>
</div>
<p style = 'font-size:16px;font-family:Arial'>Here, we import the required libraries, set environment variables and environment paths (if required).</p>

In [None]:
# Standard libraries
import time
import warnings

# Data manipluation and Visualization libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px

# Teradata libraries
from teradataml import (
    create_context, 
    delete_byom, 
    display,
    execute_sql,
    save_byom,
    remove_context,
    in_schema,
    KMeans,
    DataFrame,
    db_drop_table,
    db_drop_view,
    configure,
    ONNXEmbeddings,
    copy_to_sql
)
display.max_rows = 5

configure.byom_install_location = "mldb"

# NLP libraries
from wordcloud import WordCloud
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download('stopwords', quiet=True)
nltk.download('punkt', quiet=True)

# machine learning libraries
from sklearn.manifold import TSNE

# Suppress warnings
warnings.filterwarnings('ignore')
warnings.simplefilter(action='ignore', category=DeprecationWarning)
warnings.simplefilter(action='ignore', category=RuntimeWarning)
warnings.simplefilter(action='ignore', category=FutureWarning)

<hr style="height:2px;border:none;">
<b style = 'font-size:20px;font-family:Arial'>1. Initiate a connection to Vantage</b>
<p style = 'font-size:16px;font-family:Arial'>You will be prompted to provide the password. Enter your password, press the Enter key, and then use the down arrow to go to the next cell.</p>

In [None]:
%run -i ../startup.ipynb
eng = create_context(host = 'host.docker.internal', username='demo_user', password = password)
print(eng)

In [None]:
%%capture
execute_sql('''SET query_band='DEMO=Language_Model_SemanticClustering.ipynb;' UPDATE FOR SESSION; ''')

<p style = 'font-size:16px;font-family:Arial'>Begin running steps with Shift + Enter keys. </p>

<p style = 'font-size:18px;font-family:Arial'><b>Getting Data for This Demo</b></p>
<p style = 'font-size:16px;font-family:Arial'>We have provided data for this demo on cloud storage. You can either run the demo using foreign tables to access the data without any storage on your environment or download the data to local storage, which may yield faster execution. Still, there could be considerations of available storage. Two statements are in the following cell, and one is commented out. You may switch which mode you choose by changing the comment string.</p>

In [None]:
# %run -i ../run_procedure.py "call get_data('DEMO_ComplaintAnalysis_cloud');"        # Takes 1 minute
%run -i ../run_procedure.py "call get_data('DEMO_ComplaintAnalysis_local');"        # Takes 2 minutes

<p style = 'font-size:16px;font-family:Arial'>Next is an optional step – if you want to see the status of databases/tables created and space used.</p>

In [None]:
%run -i ../run_procedure.py "call space_report();"        # Takes 10 seconds

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial'><b>2. Confirmation for Model</b>
<p style = 'font-size:16px;font-family:Arial'>Before starting let us confirm that the required model is installed.</p>

In [None]:
model_name = "bge-small-en-v1.5"

In [None]:
from IPython.display import display, Markdown

df_check= DataFrame.from_query(f'''select (select 1 as cnt from embeddings_models where model_id = '{model_name}') +
(select 1 as cnt from embeddings_tokenizers where model_id =  '{model_name}') as cnt''')
if df_check.get_values()[0][0] == 2:
    print('Model is installed, please continue.')
else:
    print('Model is not installed, please go to Instalization notebook before proceeding further')
    display(Markdown("[Initialization Notebook](./Initialization_and_Model_Load.ipynb)"))

<hr style="height:2px;border:none;">
<b style = 'font-size:20px;font-family:Arial'>3. Create the Embeddings</b>

In [None]:
tdf = DataFrame(in_schema('DEMO_ComplaintAnalysis', 'consumer_complaints'))
tdf

In [None]:
tdf.shape

<hr style="height:1px;border:none;">
<b style = 'font-size:18px;font-family:Arial;'>3.1 Generate Embeddings with ONNXEmbeddings</b>

<p style = 'font-size:16px;font-family:Arial;'>
Now it's time to generate the embeddings using <b>ONNXEmbeddings</b>.<br>We run the ONNXEmbeddings function to generate embeddings for a small subset of records. The model is <b>loaded into the cache memory on each node</b>, and Teradata's <b>Massively Parallel Processing (MPP)</b> architecture ensures that embeddings are computed in parallel using <b>ONNX Runtime</b> on each node.  <br>Having said that, generating embeddings for the entire training set can be time-consuming, especially when working on a system with limited resources. In the <b>ClearScape Analytics experience</b>, only a <b>4 AMP system</b> with constrained RAM and CPU power is available. To ensure smooth execution, we test embedding generation on a small sample and use <b>pre-calculated embeddings</b> for the remainder of demo. In a real-life scenario you would tyipically encounter multiple hundred AMPs with much more compute power!<br>Also have a look at the most important input parameters of this <b>ONNXEmbeddings</b> function.
<ul style = 'font-size:16px;font-family:Arial;'>
<li><b>InputTable</b>: The source table containing the text to be embedded. </li>
<li><b>ModelTable</b>: The table storing the ONNX model.                    </li>
<li><b>TokenizerTable</b>: The table storing the tokenizer JSON file.       </li>
<li><b>Accumulate</b>: Specifies additional columns to retain in the output </li>  
<li><b>OutputFormat</b>: Specifies the data format of the output embeddings (<b>FLOAT32(354)</b>, matching the model's output dimension).</li>
</ul>
<p style = 'font-size:16px;font-family:Arial;'>
Since embedding generation is computationally expensive, we only process <b>100 records for testing</b> and rely on precomputed embeddings for further analysis.  
</p>

In [None]:
tdf_sample = tdf.iloc[:100, :]
tdf_sample=tdf_sample.assign(drop_columns = True,
                             id = tdf_sample.complaint_id,
                             txt= tdf_sample.consumer_complaint_narrative)
tdf_sample

In [None]:
my_model = DataFrame.from_query(f"select * from embeddings_models where model_id = '{model_name}'")
my_tokenizer = DataFrame.from_query(f"select model as tokenizer from embeddings_tokenizers where model_id = '{model_name}'")

In [None]:
number_dimensions_output = 384

In [None]:
DF_embeddings_sample = ONNXEmbeddings(
    newdata = tdf_sample,
    modeldata = my_model, 
    tokenizerdata = my_tokenizer, 
    accumulate = ["id", "txt"],
    model_output_tensor = "sentence_embedding",
    output_format = f'FLOAT32({number_dimensions_output})',
    enable_memory_check = False
).result

In [None]:
DF_embeddings_sample

<p style = 'font-size:16px;font-family:Arial;'> Here we can see how the embeddings are generated for the consumer_complaint_narrative. For further analysis we will use the precomputed embeddings</p>

In [None]:
tdf_embeddings_store = DataFrame(in_schema('DEMO_ComplaintAnalysis', 'Complaints_Embeddings_Store'))

In [None]:
tdf_embeddings_store.shape

<hr style="height:2px;border:none;">
<b style = 'font-size:20px;font-family:Arial'>4. Run K-Means on the Embeddings Store and then build final table with Cluster ID assignments to rows</b>

<p style = 'font-size:16px;font-family:Arial'>The K-means() function groups a set of observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or cluster centroid). This algorithm minimizes the objective function, that is, the total Euclidean distance of all data points from the center of the cluster</p>

In [None]:
embedding_column_list = [col for col in tdf_embeddings_store.columns if col not in ["id", "txt"]]

num_clusters = 4
kmeans_out = KMeans(
    id_column="id",
    data=tdf_embeddings_store,
    target_columns=embedding_column_list,
    output_cluster_assignment=True,
    num_init=10,
    num_clusters=num_clusters,
    iter_max=50,
)

<p style = 'font-size:16px;font-family:Arial'>The output below shows cluster assignment for each row.</p>

In [None]:
kmeans_out.result

<p style = 'font-size:16px;font-family:Arial'>Let's check how many data points each cluster has.</p>

In [None]:
kmeans_df = kmeans_out.result
d2 = kmeans_df.groupby('td_clusterid_kmeans').count()
d2

In [None]:
clustered_df = tdf_embeddings_store.join(
    other = kmeans_df,
    on = ["id"],
    how = "inner",
    lprefix="l",
    rprefix="r"
)

<p style = 'font-size:16px;font-family:Arial'>Let's check a sample of data points from cluster number 2.</p>

In [None]:
final_df = clustered_df[["l_id","txt","td_clusterid_kmeans"]]
final_df[final_df.td_clusterid_kmeans == 2]

<hr style="height:2px;border:none;">
<b style = 'font-size:20px;font-family:Arial'>5. Visualization</b>

<hr style="height:1px;border:none;">
<b style = 'font-size:18px;font-family:Arial'>5.1 WordCloud Visualization</b>

<p style = 'font-size:16px;font-family:Arial'>Let's visualize all the clusters through wordcloud visualization.</p>

In [None]:
# Word Cloud all those clusters
for i in range(0, num_clusters):
    filtered_df = final_df[final_df.td_clusterid_kmeans == i]
    df = filtered_df.to_pandas()
    total_rows = len(df)

    sw = ['x','xx','xxx','xxxx','xxxxx','xxxxxx']
    es = list(set(stopwords.words('english')))
    es.extend(sw)

    text_tokens = word_tokenize(' '.join(df['txt']),preserve_line=True)
    l_text_tokens = [item.lower() for item in text_tokens]
    tokens_without_sw = [word for word in l_text_tokens if word not in es]

    all_text = pd.Series(tokens_without_sw)

    vectorizer = TfidfVectorizer(ngram_range=(1,4))
    tfidf_matrix = vectorizer.fit_transform(all_text)

    col_sum = tfidf_matrix.sum(axis=0).A.squeeze()
    k = 5
    top_indices = np.argsort(col_sum)[-k:][::-1]

    dense = tfidf_matrix.todense()
    df = pd.DataFrame(dense, columns=vectorizer.get_feature_names_out())
    feature_names = vectorizer.get_feature_names_out()
    top_terms = feature_names[top_indices]
    cluster_name = '_'.join(top_terms).upper()

    print("Cluster #" + str(i)+": "+cluster_name+"\n # of Rows: "+str(total_rows)+"\n")
    # Generate a word cloud
    wordcloud = WordCloud(width=800, \
                          height=400, \
                          background_color='white',\
                          collocations=True,\
                          max_words=100,\
                          min_word_length=1, \
                         ).generate_from_frequencies(df.T.sum(axis=1))

    # Plot the word cloud
    plt.figure(figsize=(10, 5))
    plt.imshow(wordcloud, interpolation='gaussian')
    plt.axis('off')
    plt.title('Word Cloud for Text Data (TF-IDF)')
    plt.show()

<hr style='height:1px;border:none;'> 

<p style='font-size:18px;font-family:Arial'><b>5.2 Visualization of Clusters with Complaints</b></p> 

<p style='font-size:16px;font-family:Arial'>The graph displays the clustering of complaints into distinct groups. Based on the analysis, the data has been divided into 5 optimal clusters, each representing a unique pattern or category of complaints. This clustering approach helps to identify the key areas or types of complaints that are most prevalent, allowing for more targeted investigation and resolution efforts.</p>

In [None]:
# emb = DataFrame('kmeans_features').to_pandas()
clus = clustered_df.to_pandas()

In [None]:
tsne = TSNE(n_components=2, random_state=123)
tsne_result = tsne.fit_transform(clus.iloc[:, 3:-1])

In [None]:
tsne_df = pd.DataFrame(tsne_result, columns=['tsne_1', 'tsne_2'])
tsne_df['cluster_id'] = clus['td_clusterid_kmeans']
tsne_df['complaint_id'] = clus['l_id']

In [None]:
import plotly.io as pio
pio.renderers.default = 'notebook' 

In [None]:
# Create a new DataFrame combining t-SNE results with complaint information
tsne_complaint_df = pd.DataFrame(tsne_result, columns=['tsne_1', 'tsne_2'])
tsne_complaint_df['cluster_id'] = clus['td_clusterid_kmeans']
tsne_complaint_df['complaint_id'] = clus['l_id']
tsne_complaint_df['complaint'] = clus['txt']

# Truncate text for hover data
max_chars = 50  # Maximum characters to display
tsne_complaint_df['truncted_complaint'] = clus['txt'].apply(lambda x: x[:max_chars] + '...' if len(x) > max_chars else x)

# Plot using Plotly Express
fig = px.scatter(tsne_complaint_df, x='tsne_1', y='tsne_2', color='cluster_id',
                 hover_data=['complaint_id', 'truncted_complaint', 'cluster_id'])

fig.update_traces(marker=dict(size=15))
fig.update_layout(
    title='t-SNE Visualization of Clusters with Complaints',
    xaxis_title='dimension-1',
    yaxis_title='dimension-2',
    xaxis=dict(tickangle=45),
    width=1000,
    height=800,
    hoverlabel=dict(
        bgcolor="white",
        font_size=16,
        font_family="Rockwell"
    ),
    autosize=False,
)

# Customize the hovertemplate
fig.update_traces(hovertemplate="<b>Complaint ID:</b> %{customdata[0]}<br>"
                                 "<b>Complaint:</b> %{customdata[1]}<br>"
                                 "<b>Cluster ID:</b> %{customdata[2]}<br><extra></extra>")

fig.show()

<hr style="height:1px;border:none;">
<p style = 'font-size:20px;font-family:Arial'><b>Conclusion</b></p>
<p style = 'font-size:16px;font-family:Arial'>In this demo we have seem that how we can run HuggingFace Embedding Model (BAAI/bge-small-1.5) in ONNX format and run it in database parallelly to create embeddings. We have done KMeans Clustering to group similar complaints.</p> 

<hr style="height:2px;border:none;">
<b style = 'font-size:20px;font-family:Arial'>6. Cleanup</b>

<hr style="height:1px;border:none;">
<p style = 'font-size:18px;font-family:Arial'> <b>Databases and Tables </b></p>
<p style = 'font-size:16px;font-family:Arial'>The following code will clean up tables and databases created above.</p>

In [None]:
%run -i ../run_procedure.py "call remove_data('DEMO_ComplaintAnalysis');"        # Takes 10 seconds

In [None]:
remove_context()

<footer style="padding-bottom:35px; border-bottom:3px solid #91A0Ab">
    <div style="float:left;margin-top:14px">ClearScape Analytics™</div>
    <div style="float:right;">
        <div style="float:left; margin-top:14px">
            Copyright © Teradata Corporation - 2024,2025. All Rights Reserved
        </div>
    </div>
</footer>