<header>
   <p  style='font-size:36px;font-family:Arial; color:#F0F0F0; background-color: #00233c; padding-left: 20pt; padding-top: 20pt;padding-bottom: 10pt; padding-right: 20pt;'>
       Sentiment Analysis for Banking Customer Churn Data
  <br>
       <img id="teradata-logo" src="https://storage.googleapis.com/clearscape_analytics_demo_data/DEMO_Logo/teradata.svg" alt="Teradata" style="width: 125px; height: auto; margin-top: 20pt;">
    </p>
</header>

<hr style="height:2px;border:none">
<p style = 'font-size:18px;font-family:Arial'><b>Import the required libraries</b></p>

<p style = 'font-size:16px;font-family:Arial'>Here, we import the required libraries, set environment variables and environment paths (if required).</p>

<div class="alert alert-block alert-warning">
<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>Note: </b><i>Please execute the Step1 notebook before executing this notebook.</i></p>
</div>

In [None]:
import warnings
warnings.filterwarnings('ignore')

import os
import pandas as pd
import teradataml as tdml
import getpass

from teradataml import (
    DataFrame,
    in_schema,
    create_context,
    ONNXEmbeddings,
    delete_byom, 
    display,
    execute_sql,
    save_byom,
    configure,
)

In [None]:
tdml.configure.val_install_location = "val"
tdml.byom_install_location = "mldb"

<hr style="height:2px;border:none">
<b style = 'font-size:20px;font-family:Arial'>1. Initiate a connection to Vantage</b>
<p style = 'font-size:16px;font-family:Arial'>You will be prompted to provide the password. Enter your password, press the Enter key, and then use the down arrow to go to the next cell.</p>

In [None]:
%run -i ../startup.ipynb
eng = create_context(host = 'host.docker.internal', username='demo_user', password = password)
print(eng)

<hr style="height:1px;border:none">

<b style = 'font-size:18px;font-family:Arial'>1.1 Drop Tables (if exist)</b>
<p style = 'font-size:16px;font-family:Arial'>Now attempt to drop the <code>complaint_embeddings_store</code> and <code>complaints</code> tables, ignoring errors if they don't exist.</p>

In [None]:
SQL = ['''DROP TABLE complaint_embeddings_store;''','''DROP TABLE complaints;''']

for i in SQL:
    try:
        tdml.execute_sql(i)
    except:
        True

<p style = 'font-size:18px;font-family:Arial'><b>2.2 Getting Data for This Demo</b></p>
<p style = 'font-size:16px;font-family:Arial'>We have provided data for this demo on cloud storage. You can either run the demo using foreign tables to access the data without any storage on your environment or download the data to local storage, which may yield faster execution. Still, there could be considerations of available storage. Two statements are in the following cell, and one is commented out. You may switch which mode you choose by changing the comment string.</p>

In [None]:
%run -i ../run_procedure.py "call get_data('DEMO_BankChurnIVSM_local');"

In [None]:
%run -i ../run_procedure.py "call space_report();"        # Takes 10 seconds

In [None]:
tdf = DataFrame(in_schema('DEMO_BankChurnIVSM', 'Complaints'))
tdf

In [None]:
tdf = tdf.assign(txt=tdf.Customer_Complaint)

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial;'><b>2. Load HuggingFace Model</b>
<p style = 'font-size:16px;font-family:Arial;'>To generate embeddings, we need an ONNX model capable of transforming text into vector representations. We use a pretrained model from [Teradata's Hugging Face repository](https://huggingface.co/Teradata/gte-base-en-v1.5), such as gte-base-en-v1.5. The model and its tokenizer are downloaded and stored in Vantage tables as BLOBs using the save_byom function.</p>

In [None]:
import os
os.environ["HF_HUB_DISABLE_PROGRESS_BARS"] = "1"
os.environ["HF_HUB_DISABLE_SYMLINKS_WARNING"] = "1"
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "0"

In [None]:
from huggingface_hub import hf_hub_download

model_name = "bge-base-en-v1.5"
number_dimensions_output = 768
model_file_name = "model.onnx"

In [None]:
# Step 1: Download Model from Teradata HuggingFace Page

hf_hub_download(repo_id=f"Teradata/{model_name}", filename=f"onnx/{model_file_name}", local_dir="./")
hf_hub_download(repo_id=f"Teradata/{model_name}", filename=f"tokenizer.json", local_dir="./")

<hr style="height:1px;border:none">
<p style = 'font-size:18px;font-family:Arial'><b>2.1 Save the Model</b></p>
<p style = 'font-size:16px;font-family:Arial'>In above steps, we have checked that the model is working fine in ONNX format. Now we will save the model file.</p>

In [None]:
try:
    tdml.db_drop_table("embeddings_models")
except Exception as e:
    pass
try:
    tdml.db_drop_table("embeddings_tokenizers")
except:
    pass

In [None]:
# Step 2: Load Models into Vantage
# a) Embedding model
save_byom(model_id = model_name, # must be unique in the models table
               model_file = f"onnx/{model_file_name}",
               table_name = 'embeddings_models' )
# b) Tokenizer
save_byom(model_id = model_name, # must be unique in the models table
              model_file = 'tokenizer.json',
              table_name = 'embeddings_tokenizers') 

<p style = 'font-size:16px;font-family:Arial;'>Recheck the installed model and tokenizer

In [None]:
df_model = DataFrame('embeddings_models')
df_model

In [None]:
df_token = DataFrame('embeddings_tokenizers')
df_token

<p style = 'font-size:16px;font-family:Arial'>Load the mode that we have save to DB in previous notebook by passing Model ID.</p>

In [None]:
my_model = DataFrame.from_query(f"select * from embeddings_models where model_id = '{model_name}'")
my_tokenizer = DataFrame.from_query(f"select model as tokenizer from embeddings_tokenizers where model_id = '{model_name}'")

<hr style="height:2px;border:none">
<b style = 'font-size:20px;font-family:Arial'>3. Generate embeddings for Complaints</b>

<p style = 'font-size:16px;font-family:Arial'>This code generate the embeddings for complaints using <code>ONNXEmbeddings</code> in-db function.</b>

In [None]:
tdml.configure.byom_install_location = "mldb"

In [None]:
DF_embeddings_complaints = ONNXEmbeddings(
    newdata = tdf.iloc[:100],
    modeldata = my_model, 
    tokenizerdata = my_tokenizer, 
    accumulate = ["CustomerId", "Customer_Complaint"],
    model_output_tensor = "sentence_embedding",
    output_format = f'FLOAT32({number_dimensions_output})',
    enable_memory_check = False
).result

<p style = 'font-size:16px;font-family:Arial'>Now, embeddings are generated. Let's copy it to DB for further use.</p>

In [None]:
tdml.copy_to_sql(DF_embeddings_complaints,table_name='complaint_embeddings_store', if_exists='replace', index=False)

tdf_complaint_embeddings_store = tdml.DataFrame('complaint_embeddings_store')

In [None]:
tdf_complaint_embeddings_store

<hr style="height:2px;border:none">
<b style = 'font-size:20px;font-family:Arial'>4. Generate embeddings for Sentiments</b>

<p style = 'font-size:16px;font-family:Arial'>For sentiment analysis, we will create one table with sentiment and then create an embeddings for the same.</p>

In [None]:
sent_df = pd.DataFrame({'id': [1,2],
      'txt': ['Positive and Upbeat comment',
              'Negative or Abusive comment',
              ]})

tdml.copy_to_sql(sent_df,table_name='sentiment_topics', if_exists='replace', index=False)

In [None]:
tdf_sent = tdml.DataFrame('sentiment_topics')

In [None]:
tdf_sent

In [None]:
DF_embeddings_sent = ONNXEmbeddings(
    newdata = tdf_sent,
    modeldata = my_model, 
    tokenizerdata = my_tokenizer, 
    accumulate = ["id", "txt"],
    model_output_tensor = "sentence_embedding",
    output_format = f'FLOAT32({number_dimensions_output})',
    enable_memory_check = False
).result

In [None]:
DF_embeddings_sent

In [None]:
try:
    tdml.db_drop_table("sentiment_topics_embeddings_store")
except:
    True

<p style = 'font-size:16px;font-family:Arial'>Now, embeddings are generated for sentiments. Let's copy it to DB for further use.</p>

In [None]:
tdml.copy_to_sql(DF_embeddings_sent,table_name='sentiment_topics_embeddings_store', if_exists='replace', index=False)
tdf_sentiment_topics_embeddings = tdml.DataFrame('sentiment_topics_embeddings_store')

In [None]:
try:
    tdml.db_drop_table("semantic_search_results")
except:
    True

<hr style="height:2px;border:none">
<b style = 'font-size:20px;font-family:Arial'>5 Find the Semantic Search using Teradata's Vantage in-DB function - VectorDistance</b>

<p style = 'font-size:16px;font-family:Arial'>
Creates <code>semantic_search_results</code> table by finding the most similar sentiment topic for each complaint using cosine similarity on embeddings.
</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The TD_VectorDistance function accepts a table of target vectors and a table of reference vectors and returns a table that contains the distance between target-reference pairs.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The function computes the distance between the target pair and the reference pair from the same table if you provide only one table as the input.</p>

In [None]:
emb_col_names = tdf_sentiment_topics_embeddings.columns[2:]

tdml.execute_sql(f"""
create multiset table semantic_search_results
as (
SELECT 
    dt.target_id,
    dt.reference_id,
    e_tgt.Customer_Complaint as target_txt,
    e_ref.txt as reference_txt,
    (1.0 - dt.distance) as similarity 
FROM
    TD_VECTORDISTANCE (
        ON complaint_embeddings_store  AS TargetTable
        ON sentiment_topics_embeddings_store AS ReferenceTable DIMENSION
        USING
            TargetIDColumn('CustomerId')
            TargetFeatureColumns('[emb_0:emb_767]')
            RefIDColumn('id')
            RefFeatureColumns('[emb_0:emb_767]')
            DistanceMeasure('cosine')
            topk(1) -- Only want the best match per complaint. If you want multi-label/multi-class - you can increase it
    ) AS dt
JOIN complaint_embeddings_store e_tgt on e_tgt.CustomerId = dt.target_id
JOIN sentiment_topics_embeddings_store e_ref on e_ref.id = dt.reference_id
) with data
""")

In [None]:
df = tdml.DataFrame('semantic_search_results')
df[df['reference_txt'] == 'Negative or Abusive comment']

In [None]:
df[df['reference_txt'] == 'Negative or Abusive comment']

<hr style="height:2px;border:none">
<b style = 'font-size:20px;font-family:Arial'>6. Cleanup</b>
<p style = 'font-size:16px;font-family:Arial'>The following code will remove the context.</p>

In [None]:
tdml.remove_context()

<hr style="height:1px;border:none;">
<b style = 'font-size:18px;font-family:Arial'>Dataset:</b>

- `CustomerId `: Customer ID
- `customer_complaint`: Complaint text

<p style = 'font-size:16px;font-family:Arial'><b>Links:</b></p>
<ul style = 'font-size:16px;font-family:Arial'>
    <li>Teradataml Python reference: <a href = 'https://docs.teradata.com/search/all?query=Python+Package+User+Guide&content-lang=en-US'>here</a></li>
</ul>

<footer style="padding-bottom:35px; border-bottom:3px solid">
    <div style="float:left;margin-top:14px">ClearScape Analytics™</div>
    <div style="float:right;">
        <div style="float:left; margin-top:14px">
            Copyright © Teradata Corporation - 2025. All Rights Reserved
        </div>
    </div>
</footer>