<header>
   <p  style='font-size:36px;font-family:Arial; color:#F0F0F0; background-color: #00233c; padding-left: 20pt; padding-top: 20pt;padding-bottom: 10pt; padding-right: 20pt;'>
       From Complaints to Clarity:<br> Uncovering Hidden Trends in Telco Customer Feedback
 <br>       
       <img id="teradata-logo" src="https://storage.googleapis.com/clearscape_analytics_demo_data/DEMO_Logo/teradata.svg" alt="Teradata" style="width: 150px; height: auto; margin-top: 20pt;">
  <br>
    </p>
</header>

<p style = 'font-size:20px;font-family:Arial'><b>Introduction:</b></p>
<p style = 'font-size:16px;font-family:Arial'>
    In this notebook, we will demonstrate how <b>Teradata helps Telcos turn customer complaints into valuable insights.</b><br>
By analyzing customer feedback, we’ll reveal hidden trends and patterns, allowing companies to understand and address customer experience challenges more effectively.
<br><br>
This demo will build an interactive dashboard that provides a clear, real-time view of these trends, helping decision-makers improve customer satisfaction.
<br><br>
Discover how Teradata's unique <code>GenAI</code> based approach ensures reliable and consistent results, making it an essential tool for the telecommunications industry.
<br><br>
Tracking the evolution of topics over time is essential for understanding patterns, behaviors, and emerging trends in large datasets of text. In industries such as customer support, social media monitoring, and market research, identifying how topics shift over time can provide valuable insights for decision-making and strategy development. Traditional manual analysis methods, however, can be labor-intensive and prone to human bias.<br>
In this demo, we explore a dynamic approach to topic trend analysis by combining message embeddings with topic embeddings, leveraging vector distance calculations to measure similarity between the two.
<br><br>
While the specific example can be applied across many sectors, we’ll focus on a use case using a synthetic dataset of fictitious telco-related complaints. This dataset contains complaints about various telecommunications services, providing valuable insights into customer sentiment and trends. By categorizing these complaints by topic, businesses can gain a deeper understanding of customer concerns in the telecommunications sector and adjust their strategies to address emerging issues more effectively.<br>
To achieve this, we will:
<ul style = 'font-size:16px;font-family:Arial'>
    <li>Generate embeddings for both customer messages and inferred/predefined topics</li>
    <li>Calculate vector distances between message and topic embeddings to assess similarity</li>
    <li>Feed the results into a dashboard to display topic trends over time, with configurable similarity thresholds and message counts</li>
</ul>
<p style = 'font-size:16px;font-family:Arial'>
<center><img src="images/workflow_topictrend.png" alt="workflow_topictrend" style="border: 4px solid #404040; border-radius: 10px;"/></center>
<p style = 'font-size:16px;font-family:Arial'>This method provides an efficient way to not only categorize messages by topic but also track how these topics evolve over time, offering actionable insights into changing customer concerns, emerging issues, and overall trends.</p>

<hr style="height:2px;border:none;">
<p style = 'font-size:18px;font-family:Arial;'><b>Downloading and installing additional software needed</b>

In [None]:
%%capture
!pip install wordcloud nltk --quiet

In [None]:
%%capture
#!pip install --force-reinstall pillow --quiet

In [None]:
%%capture
#!pip install openai

<div class="alert alert-block alert-info">
<p style = 'font-size:16px;font-family:Arial;'><b>Note: </b><i>The above libraries have to be installed. Restart the kernel after executing these cells to bring the installed libraries into memory. The simplest way to restart the Kernel is by typing <b> 0 0</b></i> (zero zero) and pressing <i>Enter</i>.</p>
</div>
<p style = 'font-size:16px;font-family:Arial;'>Here, we import the required libraries, set environment variables and environment paths (if required).</p>

In [None]:
# Standard libraries
import time
import warnings

# Data manipluation and Visualization libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px

# Teradata libraries
from teradataml import (
    create_context, 
    delete_byom, 
    display,
    execute_sql,
    save_byom,
    configure,
    ONNXEmbeddings,
    copy_to_sql,
    remove_context,
    in_schema,
    KMeans,
    KMeansPredict,
    VectorDistance,
    DataFrame,
    DATE,
    db_drop_table,
    db_drop_view,
)
display.max_rows = 5
from sqlalchemy.sql import literal_column as col

# NLP libraries
from wordcloud import WordCloud
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download('stopwords', quiet=True)
nltk.download('punkt', quiet=True)

# machine learning libraries
from sklearn.manifold import TSNE

# Suppress warnings
warnings.filterwarnings('ignore')
warnings.simplefilter(action='ignore', category=DeprecationWarning)
warnings.simplefilter(action='ignore', category=RuntimeWarning)
warnings.simplefilter(action='ignore', category=FutureWarning)

<hr style="height:2px;border:none;">
<b style = 'font-size:20px;font-family:Arial;'>1. Initiate a connection to Vantage</b>
<p style = 'font-size:16px;font-family:Arial;'>You will be prompted to provide the password. Enter your password, press the Enter key, and then use the down arrow to go to the next cell.</p>

In [None]:
%run -i ../startup.ipynb
eng = create_context(host = 'host.docker.internal', username='demo_user', password = password)
print(eng)

In [None]:
%%capture
execute_sql('''SET query_band='DEMO=Telco_Complaints_Analysis_Onnx.ipynb;' UPDATE FOR SESSION; ''')

<p style = 'font-size:16px;font-family:Arial;'>Begin running steps with Shift + Enter keys. </p>

<p style = 'font-size:18px;font-family:Arial;'><b>Getting Data for This Demo</b></p>
<p style = 'font-size:16px;font-family:Arial;'>We have provided data for this demo on cloud storage. You can either run the demo using foreign tables to access the data without any storage on your environment or download the data to local storage, which may yield faster execution. Still, there could be considerations of available storage. Two statements are in the following cell, and one is commented out. You may switch which mode you choose by changing the comment string.</p>

In [None]:
%run -i ../run_procedure.py "call get_data('DEMO_Telco_Complaints_Onnx_local');" # Takes 2 minutes
#%run -i ../run_procedure.py "call get_data('DEMO_Telco_Complaints_Onnx_cloud');" # Takes 2 minutes

<p style = 'font-size:16px;font-family:Arial;'>Next is an optional step – if you want to see the status of databases/tables created and space used.</p>

In [None]:
%run -i ../run_procedure.py "call space_report();"        # Takes 10 seconds

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial;'><b>2. Load HuggingFace Model</b>
<p style = 'font-size:16px;font-family:Arial;'>To generate embeddings, we need an ONNX model capable of transforming text into vector representations. We use a pretrained model from [Teradata's Hugging Face repository](https://huggingface.co/Teradata/gte-base-en-v1.5), such as gte-base-en-v1.5. The model and its tokenizer are downloaded and stored in Vantage tables as BLOBs using the save_byom function.</p>

In [None]:
import os
os.environ["HF_HUB_DISABLE_PROGRESS_BARS"] = "1"
os.environ["HF_HUB_DISABLE_SYMLINKS_WARNING"] = "1"
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "0"

In [None]:
from huggingface_hub import hf_hub_download

model_name = "gte-base-en-v1.5"
number_dimensions_output = 768
model_file_name = "model.onnx" 

In [None]:
# Step 1: Download Model from Teradata HuggingFace Page

hf_hub_download(repo_id=f"Teradata/{model_name}", filename=f"onnx/{model_file_name}", local_dir="./")
hf_hub_download(repo_id=f"Teradata/{model_name}", filename=f"tokenizer.json", local_dir="./")

In [None]:
try:
    db_drop_table("embeddings_models")
except:
    pass
try:
    db_drop_table("embeddings_tokenizers")
except:
    pass

In [None]:
# Step 2: Load Models into Vantage
# a) Embedding model
save_byom(model_id = model_name, # must be unique in the models table
               model_file = f"onnx/{model_file_name}",
               table_name = 'embeddings_models' )
# b) Tokenizer
save_byom(model_id = model_name, # must be unique in the models table
              model_file = 'tokenizer.json',
              table_name = 'embeddings_tokenizers') 

<p style = 'font-size:16px;font-family:Arial;'>Recheck the installed model and tokenizer

In [None]:
df_model = DataFrame('embeddings_models')
df_model

In [None]:
df_token = DataFrame('embeddings_tokenizers')
df_token

<hr style="height:2px;border:none;">
<b style = 'font-size:20px;font-family:Arial;'>3. Create the Embeddings</b>

<p style = 'font-size:16px;font-family:Arial;'> Let us take a look at our data.

In [None]:
DF_complaints = DataFrame(in_schema('DEMO_Telco_Complaints_Onnx', 'telco_consumer_complaints'))
DF_complaints

In [None]:
DF_complaints.shape

<hr style="height:2px;border:none;">
<b style = 'font-size:18px;font-family:Arial;'>3.1 Generate Embeddings with ONNXEmbeddings</b>

<p style = 'font-size:16px;font-family:Arial;'>
Now it's time to generate the embeddings using <b>ONNXEmbeddings</b>.<br>We run the ONNXEmbeddings function to generate embeddings for a small subset of records. The model is <b>loaded into the cache memory on each node</b>, and Teradata's <b>Massively Parallel Processing (MPP)</b> architecture ensures that embeddings are computed in parallel using <b>ONNX Runtime</b> on each node.  <br>Having said that, generating embeddings for the entire training set can be time-consuming, especially when working on a system with limited resources. In the <b>ClearScape Analytics experience</b>, only a <b>4 AMP system</b> with constrained RAM and CPU power is available. To ensure smooth execution, we test embedding generation on a small sample and use <b>pre-calculated embeddings</b> for the remainder of demo. In a real-life scenario you would tyipically encounter multiple hundred AMPs with much more compute power!<br>Also have a look at the most important input parameters of this <b>ONNXEmbeddings</b> function.
<ul style = 'font-size:16px;font-family:Arial;'>
<li><b>InputTable</b>: The source table containing the text to be embedded. </li>
<li><b>ModelTable</b>: The table storing the ONNX model.                    </li>
<li><b>TokenizerTable</b>: The table storing the tokenizer JSON file.       </li>
<li><b>Accumulate</b>: Specifies additional columns to retain in the output </li>  
<li><b>OutputFormat</b>: Specifies the data format of the output embeddings (<b>FLOAT32(768)</b>, matching the model's output dimension).</li>
</ul>
<p style = 'font-size:16px;font-family:Arial;'>
Since embedding generation is computationally expensive, we only process <b>10 records for testing</b> and rely on precomputed embeddings for further analysis.  
</p>


In [None]:
configure.byom_install_location = "mldb"

In [None]:
DF_sample10 = DataFrame.from_query("SELECT TOP 10 t.row_id, t.txt  FROM DEMO_Telco_Complaints_Onnx.telco_consumer_complaints t")

In [None]:
my_model = DataFrame.from_query(f"select * from embeddings_models where model_id = '{model_name}'")
my_tokenizer = DataFrame.from_query(f"select model as tokenizer from embeddings_tokenizers where model_id = '{model_name}'")

In [None]:
DF_embeddings_sample = ONNXEmbeddings(
    newdata = DF_sample10,
    modeldata = my_model, 
    tokenizerdata = my_tokenizer, 
    accumulate = ["row_id", "txt"],
    model_output_tensor = "sentence_embedding",
    output_format = f'FLOAT32({number_dimensions_output})',
    enable_memory_check = False
).result

In [None]:
DF_embeddings_sample.show_query()

In [None]:
DF_embeddings_sample

<p style = 'font-size:16px;font-family:Arial;'> Here we can see how the embeddings are generated for the compalint text we have given.</p>

<hr style="height:2px;border:none;">
<b style = 'font-size:20px;font-family:Arial;'>4. Topic Generation</b>

<p style = 'font-size:16px;font-family:Arial;'>When identifying topics from or for textual data, there are generally two approaches:  
<ol style = 'font-size:16px;font-family:Arial;'><li>
    <b>Domain Knowledge-Driven Approach:</b> Topics are predefined based on expert knowledge or business rules.  </li>
    <li><b>Data-Driven Approach:</b>Topics emerge organically from the data itself using unsupervised learning techniques. </li>
    </ol>
<p style = 'font-size:16px;font-family:Arial;'>
    For this analysis, we adopt the <b>data-driven approach</b>, allowing the structure of the dataset to define the topics rather than imposing predefined categories.  For this, we leverage the <b>semantic similarity</b> between text embeddings to group similar complaints. Instead of manually defining topics, we let a clustering algorithm <b>TD_KMEANS</b> discover natural groupings within the data.  
<br>
To ensure manageability, we limit our analysis to 5 clusters. After applying K-Means clustering to the complaint embeddings, we identify the centroids of these clusters, which represent the most central points of each topic group. To understand the nature of each cluster, we extract the 20 distinct complaints closest to each centroid, as these provide the most representative examples of the topic. Instead of manually assigning labels, we leverage a powerful large language model (LLM) to analyze these representative complaints and generate meaningful topic names.</p>

In [None]:
DF_embeddings = DataFrame(in_schema('DEMO_Telco_Complaints_Onnx', 'telco_consumer_embeddings'))

In [None]:
embedding_column_list = [col for col in DF_embeddings.columns if col not in ["row_id", "txt"]]

num_clusters = 5
kmeans_out = KMeans(
    id_column="row_id",
    data=DF_embeddings,
    target_columns=embedding_column_list,
    output_cluster_assignment=True,
    num_init=10,
    num_clusters=num_clusters,
    iter_max=50,
)

In [None]:
#print(kmeans_out.show_query())

<p style = 'font-size:16px;font-family:Arial;'>The output below shows cluster assignment for each row.</p>

In [None]:
kmeans_out.result

<p style = 'font-size:16px;font-family:Arial;'>Let's check how many data points each cluster has.</p>

In [None]:
kmeans_df = kmeans_out.result
d2 = kmeans_df.groupby('td_clusterid_kmeans').count()
d2

In [None]:
copy_to_sql(kmeans_out.model_data, "complaints_clustermodel", if_exists="replace")

In [None]:
# getting the distance of each message to their cluster centroid. We pick the 20 closest messages
DF_clusterdistance = KMeansPredict(
    data = DF_embeddings,
    object = DataFrame("complaints_clustermodel"),
    output_distance = True   
).result


DF_clusterdistance = DF_clusterdistance.assign(
    rank_distance = DF_clusterdistance.td_distance_kmeans.window(
            partition_columns=DF_clusterdistance.td_clusterid_kmeans,
            order_columns=DF_clusterdistance.td_distance_kmeans
        ).dense_rank()
    )

DF_clusterdistance_top = DF_clusterdistance.loc[DF_clusterdistance.rank_distance<=20]

In [None]:
DF_clusterdistance_top

In [None]:
DF_topmesages = DF_clusterdistance_top.join(
    DF_complaints.select(["row_id","txt"]),
    how = "inner",
    on =  ["row_id = row_id"],
    lsuffix= "a"
).select(["td_clusterid_kmeans", "txt"]).drop_duplicate()
DF_topmesages

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial;'><b>5. Visualization</b></p>
<p style = 'font-size:18px;font-family:Arial;'><b>5.1 WordCloud Visualization</b></p>

<p style = 'font-size:16px;font-family:Arial;'>Let's visualize all the clusters through wordcloud visualization.</p>

In [None]:
# Word Cloud all those clusters
for i in range(0, num_clusters):
    filtered_df = DF_topmesages[DF_topmesages.td_clusterid_kmeans == i]
    df = filtered_df.to_pandas()
    total_rows = len(df)

    sw = ['x','xx','xxx','xxxx','xxxxx','xxxxxx']
    es = list(set(stopwords.words('english')))
    es.extend(sw)

    text_tokens = word_tokenize(' '.join(df['txt']),preserve_line=True)
    l_text_tokens = [item.lower() for item in text_tokens]
    tokens_without_sw = [word for word in l_text_tokens if word not in es]

    all_text = pd.Series(tokens_without_sw)

    vectorizer = TfidfVectorizer(ngram_range=(1,4))
    tfidf_matrix = vectorizer.fit_transform(all_text)

    col_sum = tfidf_matrix.sum(axis=0).A.squeeze()
    k = 5
    top_indices = np.argsort(col_sum)[-k:][::-1]

    dense = tfidf_matrix.todense()
    df = pd.DataFrame(dense, columns=vectorizer.get_feature_names_out())
    feature_names = vectorizer.get_feature_names_out()
    top_terms = feature_names[top_indices]
    cluster_name = '_'.join(top_terms).upper()

    print("Cluster #" + str(i)+": "+cluster_name+"\n # of Rows: "+str(total_rows)+"\n")
    # Generate a word cloud
    wordcloud = WordCloud(width=800, \
                          height=400, \
                          background_color='white',\
                          collocations=True,\
                          max_words=100,\
                          min_word_length=1, \
                         ).generate_from_frequencies(df.T.sum(axis=1))

    # Plot the word cloud
    plt.figure(figsize=(10, 5))
    plt.imshow(wordcloud, interpolation='gaussian')
    plt.axis('off')
    plt.title('Word Cloud for Text Data (TF-IDF)')
    plt.show()

<hr style='height:2px;border:none;'> 

<p style='font-size:18px;font-family:Arial;'><b>5.2 Visualization of Clusters with Complaints</b></p> 

<p style='font-size:16px;font-family:Arial;'>The graph displays the clustering of complaints into distinct groups. Based on the analysis, the data has been divided into 5 optimal clusters, each representing a unique pattern or category of complaints. This clustering approach helps to identify the key areas or types of complaints that are most prevalent, allowing for more targeted investigation and resolution efforts.</p>

In [None]:
clustered_df = DF_embeddings.join(
    other = kmeans_df,
    on = ["row_id"],
    how = "inner",
    lprefix="l",
    rprefix="r"
)

In [None]:
#adding complaint txt
clustered_df_txt = clustered_df.join(
    DF_complaints.select(["row_id","txt"]),
    how = "inner",
    on =  ["l_row_id = row_id"],
    lsuffix= "a"
).drop_duplicate()

In [None]:
# emb = DataFrame('kmeans_features').to_pandas()
clus = clustered_df_txt.to_pandas()

In [None]:
tsne = TSNE(n_components=2, random_state=123)
tsne_result = tsne.fit_transform(clus.iloc[:, 2:-3])

In [None]:
tsne_df = pd.DataFrame(tsne_result, columns=['tsne_1', 'tsne_2'])
tsne_df['cluster_id'] = clus['td_clusterid_kmeans']
tsne_df['complaint_id'] = clus['l_row_id']

In [None]:
import plotly.io as pio
pio.renderers.default = 'notebook' 

In [None]:
# Create a new DataFrame combining t-SNE results with complaint information
tsne_complaint_df = pd.DataFrame(tsne_result, columns=['tsne_1', 'tsne_2'])
tsne_complaint_df['cluster_id'] = clus['td_clusterid_kmeans']
tsne_complaint_df['complaint_id'] = clus['l_row_id']
tsne_complaint_df['complaint'] = clus['txt']

# Truncate text for hover data
max_chars = 50  # Maximum characters to display
tsne_complaint_df['truncted_complaint'] = clus['txt'].apply(lambda x: x[:max_chars] + '...' if len(x) > max_chars else x)

# Plot using Plotly Express
fig = px.scatter(tsne_complaint_df, x='tsne_1', y='tsne_2', color='cluster_id',
                 hover_data=['complaint_id', 'truncted_complaint', 'cluster_id'])

fig.update_traces(marker=dict(size=15))
fig.update_layout(
    title='t-SNE Visualization of Clusters with Complaints',
    xaxis_title='dimension-1',
    yaxis_title='dimension-2',
    xaxis=dict(tickangle=45),
    width=1000,
    height=800,
    hoverlabel=dict(
        bgcolor="white",
        font_size=16,
        font_family="Rockwell"
    ),
    autosize=False,
)

# Customize the hovertemplate
fig.update_traces(hovertemplate="<b>Complaint ID:</b> %{customdata[0]}<br>"
                                 "<b>Complaint:</b> %{customdata[1]}<br>"
                                 "<b>Cluster ID:</b> %{customdata[2]}<br><extra></extra>")

fig.show()

<hr style="height:2px;border:none;">
<b style = 'font-size:20px;font-family:Arial;'>6. Get Topic Names by asking a LLM</b>
<p style = 'font-size:16px;font-family:Arial;'>
To leverage the summarization capabilities of large-scale language models, we use a multi-billion parameter model to generate meaningful topic names based on representative complaints from each cluster. This step requires an OpenAI API key, as the model runs through an external API. If you don't have an OpenAI API key, use the pre-generated topic names below.<br>Also, feel free to play around with the prompt and see how this changes the cluster names.</p>

In [None]:
# set to True, if you have an OpenAI key
I_Have_an_OpenAI_API_Key = False

In [None]:
if I_Have_an_OpenAI_API_Key:
    import os, getpass
    os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter OPENAI API KEY")

In [None]:
if I_Have_an_OpenAI_API_Key:
    prompt_template = """Your task is to identify a common topic of 10 messages that have shown similar vector embeddings. 
    Your answer should be exactly one sentence, maximal 10 words long, summarising the topic. You can skip unneccary filler words.
    The answer should not be starting with "The common topic of the messages is", or "the topic is", or "Customers are complaining" etc.
    
    Here are the 10 messages:
    
    {messages}
    
    ====
    Topic:
    """

In [None]:
df_topmessages=DF_topmesages.to_pandas()

In [None]:
if I_Have_an_OpenAI_API_Key:
    from openai import OpenAI
    results =  {}
    client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
    
    for i in range (5):
        cluster_feedback = '\n\n'.join(df_topmessages[df_topmessages['td_clusterid_kmeans'] == i]['txt'])
        this_prompt = prompt_template.format(messages = cluster_feedback)
        try:
            chat_completion = client.chat.completions.create(
                messages=[
                    {
                        "role": "user",
                        "content": this_prompt,
                    }
                ],
                model="gpt-4o",
                temperature=0,
                max_tokens=4096
            )
            results[i] = chat_completion.choices[0].message.content.strip()
        except Exception as e:
            raise ValueError(f"Failed to call OpenAI API: {str(e)}")
    

In [None]:
if not I_Have_an_OpenAI_API_Key :
    #pre-defined topics
    results = {
            0: 'Network Coverage',
            1: 'Call Drops',
            2: 'Internet Speed',
            3: 'Billing Errors',
            4: 'Overcharging'
}
results

In [None]:
# Convert dict → DataFrame with 2 columns
df = pd.DataFrame(list(results.items()), columns=["id", "txt"])
copy_to_sql(df,table_name='telco_topics_of_interest', if_exists='replace', index=False)

<hr style="height:2px;border:none;">
<b style = 'font-size:20px;font-family:Arial;'>7. Generate Embeddings for Topics and  get Similarity</b>


<p style = 'font-size:16px;font-family:Arial;'>
Now that we have abstracted topics from the data, we need to generate embeddings for them. This step is crucial because, in the next phase, we will calculate the <b>pairwise similarity</b> between complaints and topics, effectively computing a <b>Cartesian product</b> of all complaint-topic pairs.</p>


<hr style="height:2px;border:none;">
<p style = 'font-size:18px;font-family:Arial;'><b>7.1 Generating Embedding for Topics Data</b></p>
<p style = 'font-size:16px;font-family:Arial;'>We will generate the embeddings for the Topics data in 3 steps as explained earlier in section 3.</p>

In [None]:
df_topic = ONNXEmbeddings(
    newdata = DataFrame('telco_topics_of_interest'),
    modeldata = my_model, 
    tokenizerdata = my_tokenizer, 
    accumulate = ["id", "txt"],
    model_output_tensor = "sentence_embedding",
    output_format = f'FLOAT32({number_dimensions_output})',
    enable_memory_check = False
).result

In [None]:
df_topic

<p style = 'font-size:16px;font-family:Arial;'> As we can see from the above, we have generated embeddings for the topic data.</p>

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial;'><b>7.2 Semantic Similarity</b></p>
<p style = 'font-size:16px;font-family:Arial;'>Now we will run Semantic Similarity of the Topics Embeddings against the Complaints Embeddings table. Vector Distance is a measure of the similarity or dissimilarity between two vectors in multidimensional space. We will use Vantage's TD_VectorDistance function. The <b>TD_VectorDistance</b> function accepts a table of target vectors and a table of reference vectors and returns a table that contains the distance between target-reference pairs. </p>

In [None]:
DF_new_vectdist = VectorDistance(
        target_data = DF_embeddings,
        target_id_column = "row_id",
        reference_data = df_topic,
        ref_id_column = "id",
        distance_measure= "COSINE",    
        target_feature_columns="emb_0:emb_767",
        ref_feature_columns="emb_0:emb_767",
        volatile = True
    ).result.join(
         DF_complaints,
        on = ["target_id = row_id"],
        how = "left").join(
         df_topic.select(["id","txt"]),
         on = ["reference_id = id"],
         how = "left",rsuffix = 'topic')
DF_new_vectdist=DF_new_vectdist.assign(similarity = 1.0-DF_new_vectdist.distance)
DF_new_vectdist

In [None]:
#displaying the top 2 records for each reference_id from the similarity result created
window = DF_new_vectdist.window(partition_columns="reference_id",
                           order_columns="similarity",
                           sort_ascending=False)
df = window.rank()
df[df.col_rank.isin([1,2])].sort(['reference_id','col_rank']).head(10)

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial;'><b>7.3 Interactive Dashboard for BI reporting</b></p>
<p style = 'font-size:16px;font-family:Arial;'>As a final step, we build a dashboard designed to serve as a <b>business intelligence (BI) reporting tool</b>, allowing us to analyze how topic prevalence changes over time. This interactive dashboard provides a structured way to explore complaint trends and refine topic detection dynamically.
<ul style = 'font-size:16px;font-family:Arial;'>Dashboard Requirements
    <li><b>Visualizing topic trends:</b> Display the number of complaints per topic per month using a <b>multi-line chart</b>, filtering only those complaints with a similarity score above a defined threshold (default: <b>0.6</b>).</li>  
<li><b>Dynamic threshold adjustment:</b> Allow users to modify the similarity threshold, automatically updating the visualization in real time.  
    </li></ul>
<p style = 'font-size:16px;font-family:Arial;'>
The dashboard logic is encapsulated in the <code>`topics_widget.py`</code> module.</p>

In [None]:
DF_new_vectdist.assign(
             topic_id = DF_new_vectdist.reference_id,
             topic = DF_new_vectdist.txt_topic,
             year_month = col("td_month_begin(complaint_date)")
            ).select(["row_id", "topic_id","topic","similarity", "year_month"  ]).to_sql("consumer_complaint_topic_similarity", if_exists = "replace",
            primary_index = ["year_month", "topic_id"], 
            types= {"year_month":DATE},
            temporary = False)

In [None]:
DF_new_similarity = DataFrame("consumer_complaint_topic_similarity")

In [None]:
%run utils/topics_distance.py
%run utils/topics_widget.py

In [None]:
get_complaints_app()

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial;'><b>Conclusion</b></p>
<p style = 'font-size:16px;font-family:Arial;'>In this demo we have seen that how a <b>fully data-driven approach</b> can help analyze large volumes of text data, automatically identifying topics and tracking their trends over time. Instead of relying on <b>prompt engineering</b> to classify messages—which can be inconsistent, expensive, and hard to scale—we used <b>embeddings, clustering and Vector Distance</b> to get a <b>deterministic and repeatable</b> solution.  <br>
By applying <b>K-Means clustering</b> on complaint embeddings, we discovered topics without predefining them. A <b>large language model (LLM)</b> then helped generate human-readable names for these clusters, but only once—keeping costs low while still benefiting from its summarization power. From there, we converted topic names into embeddings and calculated <b>vector similarities</b>, allowing us to efficiently map messages to topics in a <b>scalable and automated</b> way.<br>The final step was building an <b>interactive BI dashboard</b> that lets users explore topic trends over time and tweak similarity thresholds. <br>
With this approach, we get the <b>best of both worlds</b>: the flexibility of unsupervised learning, the power of embeddings, and the practicality of real-time reporting—all while keeping things <b>scalable, cost-efficient, and environmentally friendly</b>.    
</p> 

<hr style="height:2px;border:none;">
<b style = 'font-size:20px;font-family:Arial;'>6. Cleanup</b>

<p style = 'font-size:18px;font-family:Arial;'><b>Work Tables</b></p>
<p style = 'font-size:16px;font-family:Arial;'>Cleanup work tables to prevent errors next time.</p>

In [None]:
tables = ['embeddings_models','embeddings_tokenizers','complaints_clustermodel','telco_topics_of_interest','consumer_complaint_topic_similarity']

# Loop through the list of tables and execute the drop table command for each table
for table in tables:
    try:
        db_drop_table(table_name=table)
    except:
        pass


<hr style="height:2px;border:none;">
<p style = 'font-size:18px;font-family:Arial;'> <b>Databases and Tables </b></p>
<p style = 'font-size:16px;font-family:Arial;'>The following code will clean up tables and databases created above.</p>

In [None]:
%run -i ../run_procedure.py "call remove_data('DEMO_Telco_Complaints_Onnx');"        # Takes 10 seconds

In [None]:
remove_context()

<footer style="padding-bottom:35px; background:#f9f9f9; border-bottom:3px solid">
    <div style="float:left;margin-top:14px">ClearScape Analytics™</div>
    <div style="float:right;">
        <div style="float:left; margin-top:14px">
            Copyright © Teradata Corporation - 2025. All Rights Reserved
        </div>
    </div>
</footer>