<header>
   <p  style='font-size:36px;font-family:Arial; color:#F0F0F0; background-color: #00233c; padding-left: 20pt; padding-top: 20pt;padding-bottom: 10pt; padding-right: 20pt;'>
       Semantic Similarity using Open Source Language Models in Database
  <br>
              <img id="teradata-logo" src="https://storage.googleapis.com/clearscape_analytics_demo_data/DEMO_Logo/teradata.svg" alt="Teradata" style="width: 125px; height: auto; margin-top: 20pt;">
    </p>
</header>

<p style = 'font-size:20px;font-family:Arial'><b>Introduction</b></p>
<p style = 'font-size:16px;font-family:Arial'>
Semantic similarity refers to the degree to which two pieces of text, words, or concepts have similar meanings. It measures how much two entities are related based on their meanings rather than just their surface forms or literal text. The similarity can be with synonyms e.g car and automobile, with realted concepts e.g doctor and nurse or with phrases e.g "she enjoys reading books" and "she loves to read" .
</p>

<p style = 'font-size:18px;font-family:Arial'><b>Applications of Semantic Similarity:</b></p>
<ul style = 'font-size:16px;font-family:Arial'>There are various applications which use semantic similarity e.g
            <li>Natural Language Processing (NLP): Used in tasks like text summarization, question-answering, and machine translation. </li>
            <li>Information Retrieval: Helps search engines return results that are conceptually related to the user's query. </li>
            <li>Recommendation Systems: Suggests similar items based on their semantic meaning  </li></ul>
    </li>
 </ul>

<p style = 'font-size:18px;font-family:Arial'><b>Why Vantage?</b></p>  
<p style = 'font-size:16px;font-family:Arial'>Teradata has Integration with LLMs with Amazon BedRock etc and also emerging Open Analytics Framework in the Cloud Lake where we can host a Language Model etc. For many on-prem customers it is not practical to move the big NLP data out of Teradata such as complaints/emails, score it and put it back even if HF models run outside the DB. Moving huge volume of historical data from Vantage for the NLP models to transform does not provide much advantage as my latency is high. Moreover on-prem customers sometimes may not have even access to Cloud/LLMs and even Open Analytics Framework and can't get any AI going today. By bringing the language models within Vantage we can bridge the gap and enable on-prem customers to run NLP models in database.</p>

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial'><b>1. Connect to Vantage</b></p>

<p style = 'font-size:16px;font-family:Arial'>Let's start by importing the libraries needed.</p>

In [None]:
# Standard libraries
import getpass
import warnings
import time

# Teradata libraries
from teradataml import *
display.max_rows = 5

#other libraries
from IPython.display import display, Markdown

# Suppress warnings
warnings.filterwarnings('ignore')
warnings.simplefilter(action='ignore', category=DeprecationWarning)
warnings.simplefilter(action='ignore', category=RuntimeWarning)
warnings.simplefilter(action='ignore', category=FutureWarning)

<p style = 'font-size:16px;font-family:Arial'>We will be prompted to provide the password. We will enter the password, press the Enter key, and then use the down arrow to go to the next cell. Begin running steps with Shift + Enter keys.</p>

In [None]:
%run -i ../startup.ipynb
eng = create_context(host = 'host.docker.internal', username='demo_user', password = password)
print(eng)

In [None]:
%%capture
execute_sql("SET query_band='DEMO=Language_Model_Semantic_Similarity_Python.ipynb;' UPDATE FOR SESSION;")

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial'><b>2. Confirmation for functions</b>
<p style = 'font-size:16px;font-family:Arial'>Before starting let us confirm that the required functions are installed.</p>

In [None]:
df_check= DataFrame.from_query('''select count(*) as cnt from dbc.tablesV where databasename = 'ivsm';''')
if df_check.get_values()[0][0] >= 10:
    print('Functions are installed, please continue.')
else:
    print('Functions are not installed, please execute the Initialization_and_Model_Load notebook before proceeding further.  When completed, please return and run this cell again.')
    display(Markdown("[Initialization Notebook](./Initialization_and_Model_Load.ipynb)"))

<hr style="height:2px;border:none;">
<p style = 'font-size:18px;font-family:Arial'> <b>3. Getting Data for This Demo</b></p>
<p style = 'font-size:16px;font-family:Arial'>We have provided data for this demo on cloud storage. Since we are using embeddings stored in Vantage for this demo we will only use the option of creating table locally.</p>

In [None]:
%run -i ../run_procedure.py "call get_data('DEMO_ComplaintAnalysis_local');"
# takes about 30 seconds, estimated space: 3 MB

<p style = 'font-size:16px;font-family:Arial'>Optional step – We should execute the below step only if we want to see the status of databases/tables created and space used.</p>

In [None]:
%run -i ../run_procedure.py "call space_report();"

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial'><b>4. Confirmation for Models Loaded in Vantage</b></p>

<p style = 'font-size:16px;font-family:Arial'>The tokenizer.json and the model.onnx is created from a huggingface embedding model and must be uploaded using the "save_byom" function earlier from the Initialization_and_Model_Load notebook.</p>


In [None]:
df_token = DataFrame('embeddings_tokenizers')
df_token

In [None]:
df_model = DataFrame("embeddings_models")
df_model

<p style = 'font-size:16px;font-family:Arial'>The above tables storing the model and tokenizer are replicated table across all the AMPs in the database, so embedding creation will happen in parallel</p>

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial'><b>5. Creating Embeddings on Source Data</b></p>

<p style = 'font-size:16px;font-family:Arial'>The data is from Consumers Complaints from <a href = 'https://www.consumerfinance.gov'>CFPB website</a> which we have loaded in table for our demo. Let us see how the data looks like.</p>

In [None]:
df = DataFrame('"DEMO_ComplaintAnalysis"."Consumer_Complaints"')
df

<p style = 'font-size:16px;font-family:Arial'> For the tokenizer function to run we'll need only two columns in the underlying table named <b>id</b> and <b>txt</b>. <br> If the table doesnt have those columns we can either rename them or just create a view with the id and txt columns at a minimum. <b>id</b> holds the unique id of the row and <b>txt</b> has the key text field that we'll create the embeddings and do semantic search on. Ideally, we want to create a two column dataset and after the embeddings run join back to original dataset using id to minimize overheads in IO/memory etc.<br> For our usecase we will rename complaint_id as id and consumer_complaint_narrative as txt in view when we create embeddings.</p> 

<hr style="height:1px;border:none;">
<p style = 'font-size:18px;font-family:Arial'><b>5.1 Creating Tokens</b></p>
<p style = 'font-size:16px;font-family:Arial'>In this first step we will create tokens on the txt column for which we are generating embeddings. We will do this by careting a view calling tokenizer_encode() on the Consumer_Complaints table that uses the tokenizer.json in the embeddings_tokenizers table. For our small system (2nodes 4amps) we are taking 1000 records only for demo.</p>

In [None]:
qry = ('''
replace view v_complaints_tokenized_for_embeddings as (
    select
        id,
        txt,
        IDS as input_ids,
        attention_mask
    from ivsm.tokenizer_encode(
        on (select top 1000 complaint_id as id, consumer_complaint_narrative as txt 
            from DEMO_ComplaintAnalysis.Consumer_Complaints)
        on (select model as tokenizer from embeddings_tokenizers 
            where model_id = 'bge-small-en-v1.5') DIMENSION
        USING
            ColumnsToPreserve('id', 'txt')
            OutputFields('IDS', 'ATTENTION_MASK')
            MaxLength(1024)
            PadToMaxLength('True')
            TokenDataType('INT64')
    ) a
)
''')
try:
    execute_sql(qry)
    print('View Created')
except Exception as e:
    print('View creation failed')
    print(f"Error: {e}")


<hr style="height:1px;border:none;">
<p style = 'font-size:18px;font-family:Arial'><b>5.2 Creating Embeddings</b></p>
<p style = 'font-size:16px;font-family:Arial'>In this next step we will create embeddings in a binary form using the tokens created in the view in step 1.</p>

In [None]:
qry = ('''
replace view complaints_embeddings as (
    select 
            *
    from ivsm.IVSM_score(
            on v_complaints_tokenized_for_embeddings  -- table with data to be scored
            on (select * from embeddings_models where model_id = 'bge-small-en-v1.5') dimension
            using
                ColumnsToPreserve('id', 'txt') -- columns to be copied from input table
                ModelType('ONNX') -- model format
                BinaryInputFields('input_ids', 'attention_mask') -- enables binary input vectors
                BinaryOutputFields('sentence_embedding')
                Caching('inquery') -- tun on model caching within the query
        ) a 
)
''')
try:
    execute_sql(qry)
    print('View Created')
except Exception as e:
    print('View creation failed')
    print(f"Error: {e}")

<p style = 'font-size:18px;font-family:Arial'><b>5.3 Creating Final Embeddings table</b></p>
<p style = 'font-size:16px;font-family:Arial'>In this last step we will create embeddings table creating a column for each embedding essentially converting an array to separate columns.</p>

<p style = 'font-size:18px;font-family:Arial'><b> Do you want to generate the embeddings?</b></p>    
<p style = 'font-size:16px;font-family:Arial'>Generating embeddings will take around <b>35-40 minutes.</b></p>
<p style = 'font-size:16px;font-family:Arial'>We have already generated embeddings for the Consumer_Complaints and stored them in <b>Vantage</b> table.</p>
 
<center><img src="images/decision_emb_gen_1.svg" alt="embeddings_decision" width=300 height=400/></center>
 
<div class="alert alert-block alert-info">
<p style = 'font-size:16px;font-family:Arial'><i><b>Note: If you would like to skip the embedding generation step to save the time and move quickly to next step, please enter "No" in the next prompt.</b></i></p>
</div>
 
<p style = 'font-size:16px;font-family:Arial'>To save time, you can move to the already generated embeddings section. However, if you would like to see how we generate the embeddings, or if you need to generate the embeddings for a different dataset, then continue to the following section.</p>

In [None]:
# Request user's input
generate = input("Do you want to generate embeddings? ('yes'/'no'): ")

# Check the user's input
if generate.lower() == 'yes':
    print("\nGreat! We'll start by generating embeddings.")

    print("\nGenerating embeddings and Saving to the database, please wait...")
    # start = time.time()
    qry=''' create multiset table complaints_embeddings_store as (
            select 
            *
            from ivsm.vector_to_columns(
            on complaints_embeddings
            using
                ColumnsToPreserve('id', 'txt') 
                VectorDataType('FLOAT32')
                VectorLength(384)
                OutputColumnPrefix('emb_')
                InputColumnName('sentence_embedding')
             ) a 
             ) with data primary index(id);
        '''

    try:
        print("Embedding process started at",time.ctime())
        start = time.time()
        execute_sql(qry)
        end = time.time()
        print('Table Created')
        print("Total time to run tokenization+embeddings took = ",(end-start)/60, " min on 2nodes 4Amp VM")
        df_emb = DataFrame('complaints_embeddings_store')
        
        
    except:
        db_drop_table('complaints_embeddings_store')
        start = time.time()
        execute_sql(qry)
        end = time.time()
        print('Table Created')
        print("Total time to run tokenization+embeddings took = ",(end-start)/60, " min on 2nodes 4Amp VM")
        df_emb = DataFrame('complaints_embeddings_store')

    print("\nEmbeddings generated and saved successfully!")

elif generate.lower() == 'no':
    print("\nLoading embeddings from the Vantage table")
    df_emb = DataFrame('"DEMO_ComplaintAnalysis"."Complaints_Embeddings_Store"')
    
else:
    print("\nInvalid input. Please enter 'yes' or 'no' to proceed.")

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial'><b>6. Embeddings Data</b></p>
<p style = 'font-size:16px;font-family:Arial'>Let us review the Embeddings table we created on the Consumer Complaints dataset earlier.</p>

In [None]:
if generate.lower() == 'yes':
    df_emb = DataFrame('complaints_embeddings_store')
elif generate.lower() == 'no':
    df_emb = DataFrame('"DEMO_ComplaintAnalysis"."Complaints_Embeddings_Store"')
    
else:
    print("\nEmbeddings not created, please run the section 5")

In [None]:
df_emb

<p style = 'font-size:16px;font-family:Arial'> As we can see from the above, 384 embeddings are created for every txt.</p>

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial'><b>7. Topics Data</b></p>


<p style = 'font-size:16px;font-family:Arial'> Now let us create a list of topics for which we will do our search.</p>

In [None]:
df = pd.DataFrame({'id': [1,2,3,4,5,6],
      'txt': ['Fradulent activity with Debit Cards at Wells Fargo',
              'Identity theft issues at Citibank',
              'Multiple account openings without authorization',
              'Irresponsible behavior by customer support',
              'App issues when transacting with bank',
              'Cant get money out of ATM',
              ]})

copy_to_sql(df,table_name='topics_of_interest', if_exists='replace', index=False)

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial'><b>8. Generating Embedding for Topics Data</b></p>
<p style = 'font-size:16px;font-family:Arial'>We will generate the embeddings for the Topics data in 3 steps as explained earlier in section 5.

In [None]:
qry = ('''
replace view v_topics_tokenized_for_embeddings as (
    select
        id,
        txt,
        IDS as input_ids,
        attention_mask
    from ivsm.tokenizer_encode(
        on (select * from topics_of_interest)
        on (select model as tokenizer from embeddings_tokenizers 
            where model_id = 'bge-small-en-v1.5') DIMENSION
        USING
            ColumnsToPreserve('id', 'txt')
            OutputFields('IDS', 'ATTENTION_MASK')
            MaxLength(1024)
            PadToMaxLength('True')
            TokenDataType('INT64')
    ) a
)
''')
try:
    execute_sql(qry)
    print('View Created')
except Exception as e:
    print('View creation failed')
    print(f"Error: {e}")

In [None]:
qry = ('''
replace view topics_embeddings as (
    select 
            *
    from ivsm.IVSM_score(
            on v_topics_tokenized_for_embeddings  -- table with data to be scored
            on (select * from embeddings_models where model_id = 'bge-small-en-v1.5') dimension
            using
                ColumnsToPreserve('id', 'txt') -- columns to be copied from input table
                ModelType('ONNX') -- model format
                BinaryInputFields('input_ids', 'attention_mask') -- enables binary input vectors
                BinaryOutputFields('sentence_embedding')
                Caching('inquery') -- tun on model caching within the query
        ) a 
)
''')
try:
    execute_sql(qry)
    print('View Created')
except Exception as e:
    print('View creation failed')
    print(f"Error: {e}")

In [None]:
qry = ('''
create table topics_embeddings_store as (
    select 
            *
    from ivsm.vector_to_columns(
            on topics_embeddings
            using
                ColumnsToPreserve('id', 'txt') 
                VectorDataType('FLOAT32')
                VectorLength(384)
                OutputColumnPrefix('emb_')
                InputColumnName('sentence_embedding')
        ) a 
) with data
''')
try:
    execute_sql(qry)
    print('Table Created')
except:
    db_drop_table('topics_embeddings_store')
    execute_sql(qry)
    print('Table Created')


In [None]:
df_topic = DataFrame('topics_embeddings_store')
df_topic

<p style = 'font-size:16px;font-family:Arial'> As we can see from the above, we have generated embeddings for the topic data.</p>

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial'><b>9. Semantic Similarity</b></p>
<p style = 'font-size:16px;font-family:Arial'>Now we will run Semantic Similarity of the Topics Embeddings against the Complaints Embeddings table. Vector Distance is a measure of the similarity or dissimilarity between two vectors in multidimensional space. We will use Vantage's TD_VectorDistance function. The <b>TD_VectorDistance</b> function accepts a table of target vectors and a table of reference vectors and returns a table that contains the distance between target-reference pairs. </p>

In [None]:
# Check the user's input before to generate embeddings
qry1= '''
create multiset table semantic_search_results
as (
SELECT 
    dt.target_id,
    dt.reference_id,
    e_tgt.consumer_complaint_narrative as target_txt,
    e_ref.txt as reference_txt,
    (1.0 - dt.distance) as similarity 
FROM
    TD_VECTORDISTANCE (
        ON (select * from complaints_embeddings_store a) AS TargetTable
        ON topics_embeddings_store AS ReferenceTable DIMENSION
        USING
            TargetIDColumn('id')
            TargetFeatureColumns('[emb_0:emb_383]')
            RefIDColumn('id')
            RefFeatureColumns('[emb_0:emb_383]')
            DistanceMeasure('cosine')
            topk(1) -- Only want the best match per complaint. If you want multi-label/multi-class - you can increase it
    ) AS dt
JOIN DEMO_ComplaintAnalysis.Consumer_Complaints e_tgt on e_tgt.complaint_id = dt.target_id
JOIN topics_embeddings_store e_ref on e_ref.id = dt.reference_id
WHERE dt.distance < 0.3 -- Cosine Similarity of 0.7 or greater
) with data;
'''
qry2= '''
create multiset table semantic_search_results
as (
SELECT 
    dt.target_id,
    dt.reference_id,
    e_tgt.consumer_complaint_narrative as target_txt,
    e_ref.txt as reference_txt,
    (1.0 - dt.distance) as similarity 
FROM
    TD_VECTORDISTANCE (
        ON (select * from DEMO_ComplaintAnalysis.Complaints_Embeddings_Store a) AS TargetTable
        ON topics_embeddings_store AS ReferenceTable DIMENSION
        USING
            TargetIDColumn('id')
            TargetFeatureColumns('[emb_0:emb_383]')
            RefIDColumn('id')
            RefFeatureColumns('[emb_0:emb_383]')
            DistanceMeasure('cosine')
            topk(1) -- Only want the best match per complaint. If you want multi-label/multi-class - you can increase it
    ) AS dt
JOIN DEMO_ComplaintAnalysis.Consumer_Complaints e_tgt on e_tgt.complaint_id = dt.target_id
JOIN topics_embeddings_store e_ref on e_ref.id = dt.reference_id
WHERE dt.distance < 0.3 -- Cosine Similarity of 0.7 or greater
) with data;
'''

if generate.lower() == 'yes':
    try:
        execute_sql(qry1)
        print("Semantic Search Results table created")
    except:
        db_drop_table('semantic_search_results')
        execute_sql(qry1)
        print("Semantic Search Results table created")
elif generate.lower() == 'no':
    try:
        execute_sql(qry2)
        print("Semantic Search Results table created")
    except:
        db_drop_table('semantic_search_results')
        execute_sql(qry2)
        print("Semantic Search Results table created")
    
else:
    print("\nError creating the Semantic Search Results")


<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial'><b>10. Check Matches</b></p>
<p style = 'font-size:16px;font-family:Arial'>

In [None]:
df_results = DataFrame('semantic_search_results')

#displaying the results with most simialrity first
df_results.sort(['similarity'],ascending= False)

In [None]:
#displaying the top 2 records for each reference_id from the similarity result created
window = df_results.window(partition_columns="reference_id",
                           order_columns="similarity",
                           sort_ascending=False)
df = window.rank()
df[df.col_rank.isin([1,2])].sort(['reference_id','col_rank']).head(10)

<p style = 'font-size:20px;font-family:Arial'><b>Conclusion</b></p>
<p style = 'font-size:16px;font-family:Arial'>In this demo we have seem that how we can run HuggingFace Embedding Model (BAAI/bge-small-1.5) in ONNX format and run it in database parallelly to create embeddings. We have done Cosine Similarity match using TD_VectorDistance function to find the similar topics.</p> 

<hr style="height:2px;border:none;">
<b style = 'font-size:20px;font-family:Arial'>11. Cleanup</b>

<p style = 'font-size:18px;font-family:Arial'> <b>Work Tables </b></p>
<p style = 'font-size:16px;font-family:Arial'>The following code will clean up tables created above.</p>

In [None]:
tables = ['complaints_embeddings_store', 'topics_embeddings_store','semantic_search_results','topics_of_interest']

# Loop through the list of tables and execute the drop table command for each table
for table in tables:
    try:
        db_drop_table(table_name=table)
    except:
        pass  
    
views = ['v_complaints_tokenized_for_embeddings','complaints_embeddings','v_topics_tokenized_for_embeddings',
         'topics_embeddings']   

for view in views:
    try:
        db_drop_view(view_name=view)
    except:
        pass 

<p style = 'font-size:18px;font-family:Arial'> <b>Databases and Tables </b></p>
<p style = 'font-size:16px;font-family:Arial'>The following code will clean up tables and databases created above.</p>

In [None]:
%run -i ../run_procedure.py "call remove_data('DEMO_ComplaintAnalysis');"        # Takes 10 seconds

In [None]:
remove_context()

<footer style="padding-bottom:35px; border-bottom:3px solid #91A0Ab">
    <div style="float:left;margin-top:14px">ClearScape Analytics™</div>
    <div style="float:right;">
        <div style="float:left; margin-top:14px">
            Copyright © Teradata Corporation - 2024. All Rights Reserved
        </div>
    </div>
</footer>