<header>
   <p  style='font-size:36px;font-family:Arial; color:#F0F0F0; background-color: #00233c; padding-left: 20pt; padding-top: 20pt;padding-bottom: 10pt; padding-right: 20pt;'>
       Semantic Similarity using Open Source Language Models in Database
  <br>
              <img id="teradata-logo" src="https://storage.googleapis.com/clearscape_analytics_demo_data/DEMO_Logo/teradata.svg" alt="Teradata" style="width: 125px; height: auto; margin-top: 20pt;">
    </p>
</header>

<p style = 'font-size:20px;font-family:Arial'><b>Introduction</b></p>
<p style = 'font-size:16px;font-family:Arial'>
Semantic similarity refers to the degree to which two pieces of text, words, or concepts have similar meanings. It measures how much two entities are related based on their meanings rather than just their surface forms or literal text. The similarity can be with synonyms e.g car and automobile, with realted concepts e.g doctor and nurse or with phrases e.g "she enjoys reading books" and "she loves to read" .
</p>

<p style = 'font-size:18px;font-family:Arial'><b>Applications of Semantic Similarity:</b></p>
<ul style = 'font-size:16px;font-family:Arial'>There are various applications which use semantic similarity e.g
            <li>Natural Language Processing (NLP): Used in tasks like text summarization, question-answering, and machine translation. </li>
            <li>Information Retrieval: Helps search engines return results that are conceptually related to the user's query. </li>
            <li>Recommendation Systems: Suggests similar items based on their semantic meaning  </li></ul>
    </li>
 </ul>

<p style = 'font-size:18px;font-family:Arial'><b>Why Vantage?</b></p>  
<p style = 'font-size:16px;font-family:Arial'>Teradata has Integration with LLMs with Amazon BedRock etc and also emerging Open Analytics Framework in the Cloud Lake where we can host a Language Model etc. For many on-prem customers it is not practical to move the big NLP data out of Teradata such as complaints/emails, score it and put it back even if HF models run outside the DB. Moving huge volume of historical data from Vantage for the NLP models to transform does not provide much advantage as my latency is high. Moreover on-prem customers sometimes may not have even access to Cloud/LLMs and even Open Analytics Framework and can't get any AI going today. By bringing the language models within Vantage we can bridge the gap and enable on-prem customers to run NLP models in database.</p>

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial'><b>1. Connect to Vantage</b></p>

<p style = 'font-size:16px;font-family:Arial'>Let's start by importing the libraries needed.</p>

In [None]:
# Standard libraries
import getpass
import warnings
import time

# Teradata libraries
from teradataml import *
display.max_rows = 5

#other libraries
from IPython.display import display, Markdown

configure.byom_install_location = "mldb"

# Suppress warnings
warnings.filterwarnings('ignore')
warnings.simplefilter(action='ignore', category=DeprecationWarning)
warnings.simplefilter(action='ignore', category=RuntimeWarning)
warnings.simplefilter(action='ignore', category=FutureWarning)

<p style = 'font-size:16px;font-family:Arial'>We will be prompted to provide the password. We will enter the password, press the Enter key, and then use the down arrow to go to the next cell. Begin running steps with Shift + Enter keys.</p>

In [None]:
%run -i ../startup.ipynb
eng = create_context(host = 'host.docker.internal', username='demo_user', password = password)
print(eng)

In [None]:
%%capture
execute_sql("SET query_band='DEMO=Language_Model_Semantic_Similarity_Python.ipynb;' UPDATE FOR SESSION;")

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial'><b>2. Confirmation for Model</b>
<p style = 'font-size:16px;font-family:Arial'>Before starting let us confirm that the required model is installed.</p>

In [None]:
model_name = "bge-small-en-v1.5"

In [None]:
from IPython.display import display, Markdown

df_check= DataFrame.from_query(f'''select (select 1 as cnt from embeddings_models where model_id = '{model_name}') +
(select 1 as cnt from embeddings_tokenizers where model_id =  '{model_name}') as cnt''')
if df_check.get_values()[0][0] == 2:
    print('Model is installed, please continue.')
else:
    print('Model is not installed, please go to Instalization notebook before proceeding further')
    display(Markdown("[Initialization Notebook](./Initialization_and_Model_Load.ipynb)"))

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial'> <b>3. Getting Data for This Demo</b></p>
<p style = 'font-size:16px;font-family:Arial'>We have provided data for this demo on cloud storage. Since we are using embeddings stored in Vantage for this demo we will only use the option of creating table locally.</p>

In [None]:
%run -i ../run_procedure.py "call get_data('DEMO_ComplaintAnalysis_local');"
# takes about 30 seconds, estimated space: 3 MB

<p style = 'font-size:16px;font-family:Arial'>Optional step – We should execute the below step only if we want to see the status of databases/tables created and space used.</p>

In [None]:
%run -i ../run_procedure.py "call space_report();"

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial'><b>4. Creating Embeddings on Source Data</b></p>

<p style = 'font-size:16px;font-family:Arial'>The data is from Consumers Complaints from <a href = 'https://www.consumerfinance.gov'>CFPB website</a> which we have loaded in table for our demo. Let us see how the data looks like.</p>

In [None]:
tdf = DataFrame('"DEMO_ComplaintAnalysis"."Consumer_Complaints"')
tdf

<b style = 'font-size:18px;font-family:Arial;'>4.1 Generate Embeddings with ONNXEmbeddings</b>

<p style = 'font-size:16px;font-family:Arial;'>
Now it's time to generate the embeddings using <b>ONNXEmbeddings</b>.<br>We run the ONNXEmbeddings function to generate embeddings for a small subset of records. The model is <b>loaded into the cache memory on each node</b>, and Teradata's <b>Massively Parallel Processing (MPP)</b> architecture ensures that embeddings are computed in parallel using <b>ONNX Runtime</b> on each node.  <br>Having said that, generating embeddings for the entire training set can be time-consuming, especially when working on a system with limited resources. In the <b>ClearScape Analytics experience</b>, only a <b>4 AMP system</b> with constrained RAM and CPU power is available. To ensure smooth execution, we test embedding generation on a small sample and use <b>pre-calculated embeddings</b> for the remainder of demo. In a real-life scenario you would tyipically encounter multiple hundred AMPs with much more compute power!<br>Also have a look at the most important input parameters of this <b>ONNXEmbeddings</b> function.
<ul style = 'font-size:16px;font-family:Arial;'>
<li><b>InputTable</b>: The source table containing the text to be embedded. </li>
<li><b>ModelTable</b>: The table storing the ONNX model.                    </li>
<li><b>TokenizerTable</b>: The table storing the tokenizer JSON file.       </li>
<li><b>Accumulate</b>: Specifies additional columns to retain in the output </li>  
<li><b>OutputFormat</b>: Specifies the data format of the output embeddings (<b>FLOAT32(354)</b>, matching the model's output dimension).</li>
</ul>
<p style = 'font-size:16px;font-family:Arial;'>
Since embedding generation is computationally expensive, we only process <b>100 records for testing</b> and rely on precomputed embeddings for further analysis.  
</p>

In [None]:
tdf_sample = tdf.iloc[:100, :]
tdf_sample=tdf_sample.assign(drop_columns = True,
                             id = tdf_sample.complaint_id,
                             txt= tdf_sample.consumer_complaint_narrative)
tdf_sample

In [None]:
my_model = DataFrame.from_query(f"select * from embeddings_models where model_id = '{model_name}'")
my_tokenizer = DataFrame.from_query(f"select model as tokenizer from embeddings_tokenizers where model_id = '{model_name}'")

In [None]:
number_dimensions_output = 384

In [None]:
DF_embeddings_sample=ONNXEmbeddings(
    newdata = tdf_sample,
    modeldata = my_model, 
    tokenizerdata = my_tokenizer, 
    accumulate = ["id", "txt"],
    model_output_tensor = "sentence_embedding",
    output_format = f'FLOAT32({number_dimensions_output})',
    enable_memory_check = False
).result

In [None]:
DF_embeddings_sample.result

<p style = 'font-size:16px;font-family:Arial;'> Here we can see how the embeddings are generated for the consumer_complaint_narrative. For further analysis we will use the precomputed embeddings</p>

In [None]:
tdf_embeddings_store = DataFrame(in_schema('DEMO_ComplaintAnalysis', 'Complaints_Embeddings_Store'))

In [None]:
tdf_embeddings_store.shape

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial'><b>5. Topics Data</b></p>


<p style = 'font-size:16px;font-family:Arial'> Now let us create a list of topics for which we will do our search.</p>

In [None]:
df_topic = pd.DataFrame({'id': [1,2,3,4,5,6],
      'txt': ['Fradulent activity with Debit Cards at Wells Fargo',
              'Identity theft issues at Citibank',
              'Multiple account openings without authorization',
              'Irresponsible behavior by customer support',
              'App issues when transacting with bank',
              'Cant get money out of ATM',
              ]})

copy_to_sql(df_topic,table_name='topics_of_interest', if_exists='replace', index=False)

In [None]:
df_topic

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial'><b>6. Generating Embedding for Topics Data</b></p>
<p style = 'font-size:16px;font-family:Arial'>We will generate the embeddings for the Topics data as we did for source_data in section 4.

In [None]:
DF_topic_embeddings = ONNXEmbeddings(
    newdata = DataFrame('topics_of_interest'),
    modeldata = my_model, 
    tokenizerdata = my_tokenizer, 
    accumulate = ["id", "txt"],
    model_output_tensor = "sentence_embedding",
    output_format = f'FLOAT32({number_dimensions_output})',
    enable_memory_check = False
).result

In [None]:
copy_to_sql(DF_topic_embeddings,table_name='topics_embeddings_store', if_exists='replace')

In [None]:
df_topic = DataFrame('topics_embeddings_store')
df_topic

<p style = 'font-size:16px;font-family:Arial'> As we can see from the above, we have generated embeddings for the topic data.</p>

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial'><b>9. Semantic Similarity</b></p>
<p style = 'font-size:16px;font-family:Arial'>Now we will run Semantic Similarity of the Topics Embeddings against the Complaints Embeddings table. Vector Distance is a measure of the similarity or dissimilarity between two vectors in multidimensional space. We will use Vantage's TD_VectorDistance function. The <b>TD_VectorDistance</b> function accepts a table of target vectors and a table of reference vectors and returns a table that contains the distance between target-reference pairs. </p>

In [None]:
qry= '''
create multiset table semantic_search_results
as (
SELECT 
    dt.target_id,
    dt.reference_id,
    e_tgt.consumer_complaint_narrative as target_txt,
    e_ref.txt as reference_txt,
    (1.0 - dt.distance) as similarity 
FROM
    TD_VECTORDISTANCE (
        ON (select * from DEMO_ComplaintAnalysis.Complaints_Embeddings_Store a) AS TargetTable
        ON topics_embeddings_store AS ReferenceTable DIMENSION
        USING
            TargetIDColumn('id')
            TargetFeatureColumns('[emb_0:emb_383]')
            RefIDColumn('id')
            RefFeatureColumns('[emb_0:emb_383]')
            DistanceMeasure('cosine')
            topk(1) -- Only want the best match per complaint. If you want multi-label/multi-class - you can increase it
    ) AS dt
JOIN DEMO_ComplaintAnalysis.Consumer_Complaints e_tgt on e_tgt.complaint_id = dt.target_id
JOIN topics_embeddings_store e_ref on e_ref.id = dt.reference_id
WHERE dt.distance < 0.3 -- Cosine Similarity of 0.7 or greater
) with data;
'''

try:
        execute_sql(qry)
        print("Semantic Search Results table created")
except:
        db_drop_table('semantic_search_results')
        execute_sql(qry)
        print("Semantic Search Results table created")
    


<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial'><b>10. Check Matches</b></p>
<p style = 'font-size:16px;font-family:Arial'>

In [None]:
df_results = DataFrame('semantic_search_results')

#displaying the results with most simialrity first
df_results.sort(['similarity'],ascending= False)

In [None]:
#displaying the top 2 records for each reference_id from the similarity result created
window = df_results.window(partition_columns="reference_id",
                           order_columns="similarity",
                           sort_ascending=False)
df = window.rank()
df[df.col_rank.isin([1,2])].sort(['reference_id','col_rank']).head(10)

<p style = 'font-size:20px;font-family:Arial'><b>Conclusion</b></p>
<p style = 'font-size:16px;font-family:Arial'>In this demo we have seem that how we can run HuggingFace Embedding Model (BAAI/bge-small-1.5) in ONNX format and run it in database parallelly to create embeddings. We have done Cosine Similarity match using TD_VectorDistance function to find the similar topics.</p> 

<hr style="height:2px;border:none;">
<b style = 'font-size:20px;font-family:Arial'>11. Cleanup</b>

<p style = 'font-size:18px;font-family:Arial'> <b>Work Tables </b></p>
<p style = 'font-size:16px;font-family:Arial'>The following code will clean up tables created above.</p>

In [None]:
tables = ['topics_embeddings_store','semantic_search_results']

# Loop through the list of tables and execute the drop table command for each table
for table in tables:
    try:
        db_drop_table(table_name=table)
    except:
        pass  
    


<p style = 'font-size:18px;font-family:Arial'> <b>Databases and Tables </b></p>
<p style = 'font-size:16px;font-family:Arial'>The following code will clean up tables and databases created above.</p>

In [None]:
%run -i ../run_procedure.py "call remove_data('DEMO_ComplaintAnalysis');"        # Takes 10 seconds

In [None]:
remove_context()

<footer style="padding-bottom:35px; border-bottom:3px solid #91A0Ab">
    <div style="float:left;margin-top:14px">ClearScape Analytics™</div>
    <div style="float:right;">
        <div style="float:left; margin-top:14px">
            Copyright © Teradata Corporation - 2024. All Rights Reserved
        </div>
    </div>
</footer>