<header>
   <p  style='font-size:36px;font-family:Arial; color:#F0F0F0; background-color: #00233c; padding-left: 20pt; padding-top: 20pt;padding-bottom: 10pt; padding-right: 20pt;'>
       IVSM Banking Customer Churn Embeddings Setup
  <br>
       <img id="teradata-logo" src="https://storage.googleapis.com/clearscape_analytics_demo_data/DEMO_Logo/teradata.svg" alt="Teradata" style="width: 125px; height: auto; margin-top: 20pt;">
    </p>
</header>

<hr style="height:2px;border:none">
<p style = 'font-size:18px;font-family:Arial'><b>Import the required libraries</b></p>

<p style = 'font-size:16px;font-family:Arial'>Here, we import the required libraries, set environment variables and environment paths (if required).</p>

In [None]:
import warnings
warnings.filterwarnings('ignore')

import os
import pandas as pd
import teradataml as tdml
import getpass

from teradataml import (
    DataFrame,
    in_schema,
    create_context
)

In [None]:
 tdml.configure.val_install_location = "val"

<hr style="height:2px;border:none">
<b style = 'font-size:20px;font-family:Arial'>1. Initiate a connection to Vantage</b>
<p style = 'font-size:16px;font-family:Arial'>You will be prompted to provide the password. Enter your password, press the Enter key, and then use the down arrow to go to the next cell.</p>

In [None]:
%run -i ../startup.ipynb
eng = create_context(host = 'host.docker.internal', username='demo_user', password = password)
print(eng)

<hr style="height:2px;border:none">
<p style = 'font-size:20px;font-family:Arial'><b>2. Confirmation for functions</b>
<p style = 'font-size:16px;font-family:Arial'>Before starting let us confirm that the required functions are installed.</p>

In [None]:
from IPython.display import display, Markdown

df_check= DataFrame.from_query('''select count(*) as cnt from dbc.tablesV where databasename = 'ivsm';''')
if df_check.get_values()[0][0] >= 10:
    print('Functions are installed, please continue.')
else:
    print('Functions are not installed, please go to Instalization notebook before proceeding further')
    display(Markdown("[Initialization Notebook](./IVSM_Banking_Customer_Churn_Model_Install.ipynb)"))

<b style = 'font-size:18px;font-family:Arial'>2.1 Drop Tables (if exist)</b>
<p style = 'font-size:16px;font-family:Arial'>Attempts to drop <code>complaint_embeddings_store</code> and <code>complaints</code> tables, ignoring errors if they don't exist.</p>

In [None]:
SQL = ['''DROP TABLE complaint_embeddings_store;''','''DROP TABLE complaints;''']

for i in SQL:
    try:
        tdml.execute_sql(i)
    except:
        True

<p style = 'font-size:18px;font-family:Arial'><b>2.2 Getting Data for This Demo</b></p>
<p style = 'font-size:16px;font-family:Arial'>We have provided data for this demo on cloud storage. You can either run the demo using foreign tables to access the data without any storage on your environment or download the data to local storage, which may yield faster execution. Still, there could be considerations of available storage. Two statements are in the following cell, and one is commented out. You may switch which mode you choose by changing the comment string.</p>

In [None]:
%run -i ../run_procedure.py "call get_data('DEMO_BankChurnIVSM_local');"

In [None]:
%run -i ../run_procedure.py "call space_report();"        # Takes 10 seconds

In [None]:
tdf = DataFrame(in_schema('DEMO_BankChurnIVSM', 'Complaints'))
tdf

<hr style="height:1px;border:none">
<b style = 'font-size:18px;font-family:Arial'>3. Creation of the view with tokenized original texts</b>

<p style = 'font-size:16px;font-family:Arial'>This code creates a view named <code>v_pdf_tokenized_for_embeddings</code> that contains tokenized consumer complaint data for embedding purposes. It selects the <code>id</code>, <code>txt</code> (complaint text), <code>input_ids</code> (tokenized representations), and <code>attention_mask</code> from a tokenization function <code>ivsm.tokenizer_encode</code>.</b>

In [None]:
tdml.execute_sql("""

Replace view v_pdf_tokenized_for_embeddings as (
    select
        top 1000 id,
        txt,
        IDS as input_ids,
        attention_mask
    from ivsm.tokenizer_encode(
        on (select CustomerId as id,
        Customer_Complaint as txt from DEMO_BankChurnIVSM.Complaints)
        on (select model as tokenizer 
            from embeddings_tokenizers where model_id = 'bge-small-en-v1.5')
            DIMENSION
        USING
            ColumnsToPreserve('id', 'txt')
            OutputFields('IDS', 'ATTENTION_MASK')
            MaxLength(1024)
            PadToMaxLength('True')
            TokenDataType('INT64')
    ) a
)
""")

In [None]:
tdml.DataFrame('v_pdf_tokenized_for_embeddings').head()

<hr style="height:1px;border:none">
<b style = 'font-size:18px;font-family:Arial'>3.1 Creation of the view with calculated binary embeddings</b>

<p style = 'font-size:16px;font-family:Arial'>This code creates a view named <code>complaints_embeddings</code> that stores the computed embeddings (vector representations) of consumer complaint texts. The embeddings are generated using the <code>ivsm.IVSM_score</code> function, which scores/tokenizes input data based on a specific model.</p>

In [None]:
tdml.execute_sql("""
Replace view complaint_embeddings as (
    select 
            *
    from ivsm.IVSM_score(
            on v_pdf_tokenized_for_embeddings  -- table with data to be scored
            on (select * from embeddings_models where model_id = 'bge-small-en-v1.5') dimension
            using
                ColumnsToPreserve('id', 'txt') -- columns to be copied from input table
                ModelType('ONNX') -- model format
                BinaryInputFields('input_ids', 'attention_mask') -- enables binary input vectors
                BinaryOutputFields('sentence_embedding')
                Caching('inquery') -- tun on model caching within the query
        ) a 
)

""")

In [None]:
tdml.DataFrame('complaint_embeddings').head(2)

<hr style="height:1px;border:none">
<p style = 'font-size:18px;font-family:Arial'><b>3.2 Creating Final Embeddings table</b></p>
<p style = 'font-size:16px;font-family:Arial'>In this step we will create embeddings table creating a column for each embedding essentially converting an array to separate columns.</p>

In [None]:
tdml.execute_sql("""
create table complaint_embeddings_store as (
    select 
            *
    from ivsm.vector_to_columns(
            on complaint_embeddings
            using
                ColumnsToPreserve('id', 'txt') 
                VectorDataType('FLOAT32')
                VectorLength(384)
                OutputColumnPrefix('emb_')
                InputColumnName('sentence_embedding')
        ) a 
) with data

""")

In [None]:
tdml.DataFrame('complaint_embeddings_store').head()

In [None]:
sent_df = pd.DataFrame({'id': [1,2],
      'txt': ['Positive and Upbeat comment',
              'Negative or Abusive comment',
              ]})

tdml.copy_to_sql(sent_df,table_name='sentiment_topics', if_exists='replace', index=False)

In [None]:
tdml.DataFrame('sentiment_topics').head()

<hr style="height:1px;border:none">
<b style = 'font-size:18px;font-family:Arial'>3.3 Create Tokenized View</b>

<p style = 'font-size:16px;font-family:Arial'>Creates a view <code>v_sentiment_tokenized_for_embeddings</code> by applying a tokenizer to the <code>sentiment_topics</code> table using the specified model.</p>

In [None]:
tdml.execute_sql("""
replace view v_sentiment_tokenized_for_embeddings as (
    select
        id,
        txt,
        IDS as input_ids,
        attention_mask
    from ivsm.tokenizer_encode(
        on (select * from sentiment_topics)
        on (select model as tokenizer from embeddings_tokenizers where model_id = 'bge-small-en-v1.5') DIMENSION
        USING
            ColumnsToPreserve('id', 'txt')
            OutputFields('IDS', 'ATTENTION_MASK')
            MaxLength(1024)
            PadToMaxLength('True')
            TokenDataType('INT64')
    ) a
)
""")

In [None]:
tdml.DataFrame('v_sentiment_tokenized_for_embeddings').head()

<p style = 'font-size:16px;font-family:Arial'>Defines <code>sentiment_topics_embeddings</code> view by generating sentence embeddings using the <code>IVSM_score</code> function and a specified ONNX model.</p>

In [None]:
tdml.execute_sql("""
replace view sentiment_topics_embeddings as (
    select 
            *
    from ivsm.IVSM_score(
            on v_sentiment_tokenized_for_embeddings  -- table with data to be scored
            on (select * from embeddings_models where model_id = 'bge-small-en-v1.5') dimension
            using
                ColumnsToPreserve('id', 'txt') -- columns to be copied from input table
                ModelType('ONNX') -- model format
                BinaryInputFields('input_ids', 'attention_mask') -- enables binary input vectors
                BinaryOutputFields('sentence_embedding')
                Caching('inquery') -- tun on model caching within the query
        ) a 
)
""")

In [None]:
tdml.DataFrame('sentiment_topics_embeddings').head()

In [None]:
try:
    tdml.db_drop_table("sentiment_topics_embeddings_store")
except:
    True

<hr style="height:1px;border:none">
<b style = 'font-size:18px;font-family:Arial'>3.4 Store Embeddings as Columns</b>

<p style = 'font-size:16px;font-family:Arial'>
Creates a table <code>sentiment_topics_embeddings_store</code> by converting the sentence embeddings into individual float columns using <code>vector_to_columns</code>.
</p>

In [None]:
tdml.execute_sql("""
create table sentiment_topics_embeddings_store as (
    select 
            *
    from ivsm.vector_to_columns(
            on sentiment_topics_embeddings
            using
                ColumnsToPreserve('id', 'txt') 
                VectorDataType('FLOAT32')
                VectorLength(384)
                OutputColumnPrefix('emb_')
                InputColumnName('sentence_embedding')
        ) a 
) with data
""")

In [None]:
tdml.DataFrame('sentiment_topics_embeddings_store').head()

In [None]:
try:
    tdml.db_drop_table("semantic_search_results")
except:
    True

<hr style="height:1px;border:none">
<b style = 'font-size:18px;font-family:Arial'>3.5 Semantic Search Results Table</b>

<p style = 'font-size:16px;font-family:Arial'>
Creates <code>semantic_search_results</code> table by finding the most similar sentiment topic for each complaint using cosine similarity on embeddings.
</p>


In [None]:
tdml.execute_sql("""
create multiset table semantic_search_results
as (
SELECT 
    dt.target_id,
    dt.reference_id,
    e_tgt.txt as target_txt,
    e_ref.txt as reference_txt,
    (1.0 - dt.distance) as similarity 
FROM
    TD_VECTORDISTANCE (
        ON complaint_embeddings_store  AS TargetTable
        ON sentiment_topics_embeddings_store AS ReferenceTable DIMENSION
        USING
            TargetIDColumn('id')
            TargetFeatureColumns('[emb_0:emb_383]')
            RefIDColumn('id')
            RefFeatureColumns('[emb_0:emb_383]')
            DistanceMeasure('cosine')
            topk(1) -- Only want the best match per complaint. If you want multi-label/multi-class - you can increase it
    ) AS dt
JOIN complaint_embeddings_store e_tgt on e_tgt.id = dt.target_id
JOIN sentiment_topics_embeddings_store e_ref on e_ref.id = dt.reference_id
) with data
""")

In [None]:
tdml.DataFrame('semantic_search_results').head()

In [None]:
df = tdml.DataFrame('semantic_search_results')
df[df['reference_txt'] == 'Negative or Abusive comment']

In [None]:
df[df['reference_txt'] == 'Positive and Upbeat comment']

<hr style="height:2px;border:none">
<b style = 'font-size:20px;font-family:Arial'>4. Cleanup</b>
<p style = 'font-size:16px;font-family:Arial'>The following code will remove the context.</p>

In [None]:
tdml.remove_context()

<hr style="height:1px;border:none;">
<b style = 'font-size:18px;font-family:Arial'>Dataset:</b>

- `CustomerId `: Customer ID
- `customer_complaint`: Complaint text

<p style = 'font-size:16px;font-family:Arial'><b>Links:</b></p>
<ul style = 'font-size:16px;font-family:Arial'>
    <li>Teradataml Python reference: <a href = 'https://docs.teradata.com/search/all?query=Python+Package+User+Guide&content-lang=en-US'>here</a></li>
</ul>

<footer style="padding-bottom:35px; border-bottom:3px solid">
    <div style="float:left;margin-top:14px">ClearScape Analytics™</div>
    <div style="float:right;">
        <div style="float:left; margin-top:14px">
            Copyright © Teradata Corporation - 2025. All Rights Reserved
        </div>
    </div>
</footer>