<header>
   <p  style='font-size:36px;font-family:Arial; color:#F0F0F0; background-color: #00233c; padding-left: 20pt; padding-top: 20pt;padding-bottom: 10pt; padding-right: 20pt;'>
       Entity Resolution with In-Database Embeddings and Analytics
 <br>       
       <img id="teradata-logo" src="https://storage.googleapis.com/clearscape_analytics_demo_data/DEMO_Logo/teradata.svg" alt="Teradata" style="width: 150px; height: auto; margin-top: 20pt;">
  <br>
    </p>
</header>

<p style = 'font-size:20px;font-family:Arial'><b>Introduction:</b></p>
<p style = 'font-size:16px;font-family:Arial'>
In most organizations, data originates from multiple systems — CRM, ERP, marketing platforms, and external sources — each storing information in different formats. This leads to duplicates, inconsistencies, and incomplete records across entities such as customers, suppliers, products, or locations.<br>
    <b>Entity Resolution (ER)</b>, also known as <b>Record Linking</b>, is the process of identifying and merging records that refer to the same real-world entity. It forms the foundation of Master Data Management (MDM), enabling organizations to achieve trusted, unified, and analytics-ready data.<br>
However, traditional ER approaches based on deterministic or rule-based matching often struggle with accuracy, scalability, and multi-lingual data, leading to high false positives and negatives. As data volumes and complexity grow, organizations need a smarter, faster, and more adaptive approach.<br>
<br>    <b>Solution Overview</b><br>    Using <b>Teradata ClearScape Analytics</b>, organizations can modernize Entity Resolution with a combination of string similarity metrics, vector embeddings, and machine learning techniques — all executed directly within the database.<br>
The approach enhances conventional text matching (e.g., Levenshtein, Jaro-Winkler) by incorporating semantic embeddings derived from pre-trained language models. These embeddings capture contextual meaning, allowing the system to identify similar entities even when data includes abbreviations, spelling errors, or multilingual variations.<br>
A machine learning model (trained using Teradata Vantage, H2O, or Scikit-Learn) uses both similarity scores and embedding distances as input features to classify potential matches. Once deployed in-database, the model performs high-speed matching across millions of records — improving accuracy by 5–15% compared to traditional methods, while maintaining deterministic, repeatable results.
    <ul style = 'font-size:16px;font-family:Arial'>Typical applications include:
        <li>Creating unified customer, vendor, or product master records</li>
        <li>Linking similar entities across multiple internal and external data sources</li>
        <li>Standardizing reference data for reporting, analytics, and AI initiatives</li>
        <li>Reducing redundancy and improving overall data governance</li>
        </ul>
    <p style = 'font-size:16px;font-family:Arial'><b>Why Teradata Vantage</b>
Teradata Vantage provides a scalable, enterprise-grade environment for advanced Entity Resolution with key advantages:
<ul style = 'font-size:16px;font-family:Arial'>
    <li><b>In-Database Analytics:</b> Run feature engineering, similarity computation, and ML inference directly in Vantage — no data movement required.</li>
    <li><b>Bring Your Own Model (BYOM):</b> Seamlessly import and operationalize models trained in external frameworks like H2O or Scikit-Learn.</li>
    <li><b>Rich Analytical Ecosystem:</b> Native support for text similarity, NLP-based vector embeddings, and machine learning functions.</li>
    <li><b>Parallel CPU Processing:</b> Delivers high performance without the need for GPUs, even on large-scale datasets.</li>
    <li><b>End-to-End Governance:</b> Integrated security, scalability, and audit capabilities ensure trusted data management at enterprise scale.   </li>    

<hr style="height:2px;border:none;">
<b style = 'font-size:20px;font-family:Arial;'>1. Initiate a connection to Vantage</b>
<p style = 'font-size:16px;font-family:Arial;'>Let'start by importing required libraries and making connection to Vantage database. You will be prompted to provide the password. Enter your password, press the Enter key, and then use the down arrow to go to the next cell.</p>

In [None]:
import warnings
warnings.filterwarnings('ignore')

from teradataml import *
import getpass
import time

In [None]:
%run -i ../startup.ipynb
eng = create_context(host = 'host.docker.internal', username='demo_user', password = password)
print(eng)

In [None]:
%%capture
execute_sql('''SET query_band='DEMO=Entity_Resolution_Python.ipynb;' UPDATE FOR SESSION; ''')

<p style = 'font-size:16px;font-family:Arial;'>Begin running steps with Shift + Enter keys. </p>

<p style = 'font-size:18px;font-family:Arial;'><b>Getting Data for This Demo</b></p>
<p style = 'font-size:16px;font-family:Arial;'>We have provided data for this demo on cloud storage. You can either run the demo using foreign tables to access the data without any storage on your environment or download the data to local storage, which may yield faster execution. Still, there could be considerations of available storage. Two statements are in the following cell, and one is commented out. You may switch which mode you choose by changing the comment string.</p>

In [None]:
%run -i ../run_procedure.py "call get_data('DEMO_Entity_Resolution_local');" # Takes 2 minutes
#%run -i ../run_procedure.py "call get_data('DEMO_Entity_Resolution_cloud');" # Takes 1 minute

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial;'><b>2. Entity Datasets</b>
<p style = 'font-size:16px;font-family:Arial;'>Let's look at the two entity datasets that we have and how the entities in these tables match with each other.</p>


In [None]:
df_abt = DataFrame(in_schema('DEMO_Entity_Resolution', 'Item_About'))
df_abt

In [None]:
df_buy = DataFrame(in_schema('DEMO_Entity_Resolution', 'Item_Buy'))
df_buy

<p style = 'font-size:16px;font-family:Arial;'>We also have a matching table for reference which let us know which id from the Item_About table corresponds with the Item_Buy table. </p>

In [None]:
df_match = DataFrame(in_schema('DEMO_Entity_Resolution', 'Item_Abt_Buy_Match'))
df_match

In [None]:
df_m = DataFrame.from_query('''select a.idAbt, a.name as name_Abt, b.idBuy, b.name as name_Buy
from DEMO_Entity_Resolution.Item_Abt_Buy_Match m
inner join DEMO_Entity_Resolution.Item_About a
on m.idAbt = a.idAbt
inner join DEMO_Entity_Resolution.Item_Buy b
on m.idBuy = b.idBuy''')
df_m

<p style = 'font-size:16px;font-family:Arial;'>From the above results we can see which ids in Item_About table are corresponding to Item_Buy table. </p>

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial;'><b>3. Create Embeddings for the Datasets</b>
<hr style="height:1px;border:none;">
<p style = 'font-size:18px;font-family:Arial;'><b>3.1 Load HuggingFace Model</b>
<p style = 'font-size:16px;font-family:Arial;'>To generate embeddings, we need an ONNX model capable of transforming text into vector representations. We use a pretrained model from [Teradata's Hugging Face repository](https://huggingface.co/Teradata/bge-small-en-v1.5), such as <b>bge-small-en-v1.5</b>. The model and its tokenizer are downloaded and stored in Vantage tables as BLOBs using the save_byom function.</p>

In [None]:
import os
os.environ["HF_HUB_DISABLE_PROGRESS_BARS"] = "1"
os.environ["HF_HUB_DISABLE_SYMLINKS_WARNING"] = "1"
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "0"

In [None]:
from huggingface_hub import hf_hub_download

model_name = "bge-small-en-v1.5"
number_dimensions_output = 384
model_file_name = "model.onnx" 

In [None]:
# Download Model from Teradata HuggingFace Page

hf_hub_download(repo_id=f"Teradata/{model_name}", filename=f"onnx/{model_file_name}", local_dir="./")
hf_hub_download(repo_id=f"Teradata/{model_name}", filename=f"tokenizer.json", local_dir="./")

In [None]:
try:
    db_drop_table("embeddings_models")
except:
    pass
try:
    db_drop_table("embeddings_tokenizers")
except:
    pass

In [None]:
# Load Models into Vantage
# a) Embedding model
save_byom(model_id = model_name, # must be unique in the models table
               model_file = f"onnx/{model_file_name}",
               table_name = 'embeddings_models' )
# b) Tokenizer
save_byom(model_id = model_name, # must be unique in the models table
              model_file = 'tokenizer.json',
              table_name = 'embeddings_tokenizers') 

In [None]:
df_model = DataFrame('embeddings_models')
df_model

In [None]:
df_token = DataFrame('embeddings_tokenizers')
df_token

<hr style="height:1px;border:none;">
<b style = 'font-size:18px;font-family:Arial;'>3.2 Generate Embeddings with ONNXEmbeddings</b>

<p style = 'font-size:16px;font-family:Arial;'>
Now it's time to generate the embeddings using <b>ONNXEmbeddings</b>.<br>We run the ONNXEmbeddings function to generate embeddings for a small subset of records. The model is <b>loaded into the cache memory on each node</b>, and Teradata's <b>Massively Parallel Processing (MPP)</b> architecture ensures that embeddings are computed in parallel using <b>ONNX Runtime</b> on each node.  <br>Having said that, generating embeddings for the entire training set can be time-consuming, especially when working on a system with limited resources. In the <b>ClearScape Analytics experience</b>, only a <b>4 AMP system</b> with constrained RAM and CPU power is available. To ensure smooth execution, we test embedding generation on a small sample and use <b>pre-calculated embeddings</b> for the remainder of demo. In a real-life scenario you would tyipically encounter multiple hundred AMPs with much more compute power!<br>Also have a look at the most important input parameters of this <b>ONNXEmbeddings</b> function.
<ul style = 'font-size:16px;font-family:Arial;'>
<li><b>InputTable</b>: The source table containing the text to be embedded. </li>
<li><b>ModelTable</b>: The table storing the ONNX model.                    </li>
<li><b>TokenizerTable</b>: The table storing the tokenizer JSON file.       </li>
<li><b>Accumulate</b>: Specifies additional columns to retain in the output </li>  
<li><b>OutputFormat</b>: Specifies the data format of the output embeddings (<b>FLOAT32(384)</b>, matching the model's output dimension).</li>
</ul>
<p style = 'font-size:16px;font-family:Arial;'>
Since embedding generation is computationally expensive, we only process <b>10 records for testing</b> and rely on precomputed embeddings for further analysis.  
</p>

In [None]:
configure.byom_install_location = "mldb"

In [None]:
DF_sample10 = DataFrame.from_query("SELECT t.idAbt, t.name as txt FROM DEMO_Entity_Resolution.Item_About t sample 10")

In [None]:
my_model = DataFrame.from_query(f"select * from embeddings_models where model_id = '{model_name}'")
my_tokenizer = DataFrame.from_query(f"select model as tokenizer from embeddings_tokenizers where model_id = '{model_name}'")

In [None]:
DF_embeddings_sample = ONNXEmbeddings(
    newdata = DF_sample10,
    modeldata = my_model, 
    tokenizerdata = my_tokenizer, 
    accumulate = ["idAbt"],
    model_output_tensor = "sentence_embedding",
    output_format = f'FLOAT32({number_dimensions_output})',
    enable_memory_check = False
).result

In [None]:
DF_embeddings_sample

<p style = 'font-size:16px;font-family:Arial;'> Here we can see how the embeddings are generated for the name text. We have generated embeddings for both Item_About and Item_Buy table.</p>

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial;'><b>4. Feature Engineering</b>
<p style = 'font-size:16px;font-family:Arial;'> We will create additional features. First we will create features using TopN Match rows (positive examples) and Non-Match rows (negative) by embedding distance Bi-Encoder (VectorDistance) results.</p>

In [None]:
sql = """
create volatile table entity_match_results_temp
as (
SELECT 
    dt.target_id as idAbt,
    dt.reference_id as idBuy,
    x.idBuy ground_truth_buy_id,
    case when x.idBuy = dt.reference_id then 1 else 0 end as match,
    e_tgt.name as target_txt,
    e_ref.name as reference_txt,
    sum(case when dt.distancetype = 'cosine' then 1.0 - dt.distance else 0 end) as emb_cosine,
    sum(case when dt.distancetype = 'euclidean' then dt.distance else 0 end) as emb_euclidean,
    sum(case when dt.distancetype = 'manhattan' then dt.distance else 0 end) as emb_manhattan
FROM
    TD_VECTORDISTANCE (
        ON DEMO_Entity_Resolution.Item_About_Embeddings  AS TargetTable
        ON DEMO_Entity_Resolution.Item_Buy_embeddings AS ReferenceTable DIMENSION
        USING
            TargetIDColumn('idAbt')
            TargetFeatureColumns('[a_emb_0:a_emb_383]')
            RefIDColumn('idBuy')
            RefFeatureColumns('[b_emb_0:b_emb_383]')
            DistanceMeasure('cosine', 'euclidean','manhattan')
            topk(100) -- max allowed
    ) AS dt
JOIN DEMO_Entity_Resolution.Item_About e_tgt on e_tgt.idAbt = dt.target_id
JOIN DEMO_Entity_Resolution.Item_Buy e_ref on e_ref.idBuy = dt.reference_id
JOIN DEMO_Entity_Resolution.Item_Abt_Buy_Match x on x.idAbt = e_tgt.idAbt
GROUP BY 1,2,3,4,5,6
) with data on commit preserve rows
"""

execute_sql(sql)


In [None]:
tdf = DataFrame("entity_match_results_temp")
tdf

<p style = 'font-size:16px;font-family:Arial;'>Now we will create ngrams which we will subsequently use in creating freatures from StringSimilarity.</p>

In [None]:
try:
    db_drop_table("entity_seq")
except:
    True

for i in range(2,8):
    for x in ['About','Buy']:
        try:
            db_drop_table("{}_QGrams{}".format(x,i))
        except:
            True
        
        try:
            db_drop_table("{}_QGrams{}".format(x,i))
        except:
            True

In [None]:
execute_sql("""
    CREATE VOLATILE TABLE entity_seq AS(
        SELECT (calendar_date - DATE '1900-01-01' + 1) AS n
        FROM sys_calendar.calendar
        WHERE calendar_date BETWEEN DATE '1900-01-01' AND DATE '1900-01-01' + 199
    ) WITH DATA PRIMARY INDEX(n) ON COMMIT PRESERVE ROWS;
""")

for i in range(2,8):
    for x in ['About']:
        execute_sql("""
            CREATE VOLATILE TABLE {}_QGrams{} AS (
                SELECT id as id, TRIM(TRAILING FROM XMLAGG(TRIM(QGrams) || ' ' ORDER BY n) (VARCHAR(10000))) AS QGrams
                    FROM 
                        ( SELECT idAbt as id, n, SUBSTRING(name FROM n FOR {}) AS QGrams
                            FROM DEMO_Entity_Resolution.item_{}, entity_seq
                            WHERE n <= CHAR_LENGTH(name) - {} + 1
                        ) as x
                GROUP BY id
            ) WITH DATA PRIMARY INDEX (id) ON COMMIT PRESERVE ROWS
        """.format(x,i,i,x,i))

for i in range(2,8):
    for x in ['Buy']:
        execute_sql("""
            CREATE VOLATILE TABLE {}_QGrams{} AS (
                SELECT id, TRIM(TRAILING FROM XMLAGG(TRIM(QGrams) || ' ' ORDER BY n) (VARCHAR(10000))) AS QGrams
                    FROM 
                        ( SELECT idBuy as id , n, SUBSTRING(name FROM n FOR {}) AS QGrams
                            FROM DEMO_Entity_Resolution.item_{}, entity_seq
                            WHERE n <= CHAR_LENGTH(name) - {} + 1
                        ) as x
                GROUP BY id
            ) WITH DATA PRIMARY INDEX (id) ON COMMIT PRESERVE ROWS
        """.format(x,i,i,x,i))        

<p style = 'font-size:16px;font-family:Arial;'> Lets take a look how these ngrams look.</p>

In [None]:
tdf2= DataFrame("About_Qgrams2")
tdf2

<p style = 'font-size:16px;font-family:Arial;'> Now we will calculate the distance between the ngrams e.g Jaro Winkler etc to create additional features.</p>

In [None]:
#query takes about 5min to execute
execute_sql("""
create table entity_match_results
as
(
    SELECT * FROM StringSimilarity (
      ON (select
              e.idAbt,
              e.idBuy,
              e.match,
              e.emb_cosine,
              e.emb_euclidean,
              e.emb_manhattan,
              cast(e.target_txt as varchar(200)) as target_txt,
              cast(e.reference_txt as varchar(200)) as reference_txt,
              cast(a2.QGrams as varchar(200)) as a2_txt, cast(b2.QGrams as varchar(200)) as b2_txt,
              cast(a3.QGrams as varchar(200)) as a3_txt, cast(b3.QGrams as varchar(200)) as b3_txt,
              cast(a4.QGrams as varchar(200)) as a4_txt, cast(b4.QGrams as varchar(200)) as b4_txt,
              cast(a5.QGrams as varchar(200)) as a5_txt, cast(b5.QGrams as varchar(200)) as b5_txt,
              cast(a6.QGrams as varchar(200)) as a6_txt, cast(b6.QGrams as varchar(200)) as b6_txt,
              cast(a7.QGrams as varchar(200)) as a7_txt, cast(b7.QGrams as varchar(200)) as b7_txt
          from 
             entity_match_results_temp e,
             About_Qgrams2 a2, Buy_Qgrams2 b2,
             About_Qgrams3 a3, Buy_Qgrams3 b3,
             About_Qgrams4 a4, Buy_Qgrams4 b4,
             About_Qgrams5 a5, Buy_Qgrams5 b5,
             About_Qgrams6 a6, Buy_Qgrams6 b6,
             About_Qgrams5 a7, Buy_Qgrams5 b7
          where 
             e.idAbt = a2.id and e.idBuy = b2.id and
             e.idAbt = a3.id and e.idBuy = b3.id and
             e.idAbt = a4.id and e.idBuy = b4.id and
             e.idAbt = a5.id and e.idBuy = b5.id and
             e.idAbt = a6.id and e.idBuy = b6.id and
             e.idAbt = a7.id and e.idBuy = b7.id
            ) PARTITION BY ANY
      USING
      ComparisonColumnPairs ('jaro (target_txt, reference_txt) AS jaro',
                             'jaro_winkler (target_txt, reference_txt) as jaro_winkler',
                             'n_gram (target_txt, reference_txt, 1) AS ngram1',
                             'n_gram (target_txt, reference_txt, 2) AS ngram2',
                             'n_gram (target_txt, reference_txt, 3) AS ngram3',
                             'n_gram (target_txt, reference_txt, 4) AS ngram4',
                             'LD (target_txt, reference_txt) AS ld',
                             'LDWS (target_txt, reference_txt) AS ldws',
                             'OSA (target_txt, reference_txt) AS osa',
                             'DL (target_txt, reference_txt) AS dl',
                             'hamming (target_txt, reference_txt) AS hamming',
                             'LCS (target_txt, reference_txt) AS lcs',
                             'jaccard (target_txt, reference_txt) AS jaccard',
                             'cosine (target_txt, reference_txt) AS term_cosine',
                             'n_gram (a2_txt, b2_txt,1) as qgrams2_sim',
                             'n_gram (a3_txt, b3_txt,1) as qgrams3_sim',
                             'n_gram (a4_txt, b4_txt,1) as qgrams4_sim',
                             'n_gram (a5_txt, b5_txt,1) as qgrams5_sim',
                             'n_gram (a6_txt, b6_txt,1) as qgrams6_sim',
                             'n_gram (a7_txt, b7_txt,1) as qgrams7_sim',
                             'soundexcode (target_txt, reference_txt) AS soundexcode'
      )
      CaseSensitive ('false')
      Accumulate ('idAbt', 'idBuy','emb_cosine','emb_euclidean','emb_manhattan','match','target_txt', 'reference_txt')
    ) AS dt 
) with data
""")

<p style = 'font-size:16px;font-family:Arial;'>Now we combine all the eambeddings and additional features we have calculated to create the final entity dataset.</p>

In [None]:
sql = """
create multiset table Entities_Final
as
(
  select 
       em.match,
       em.emb_cosine,
       em.emb_euclidean,
       em.emb_manhattan,
       em.jaro,
       em.jaro_winkler,
       em.ngram1,
       em.ngram2,
       em.ngram3,
       em.ngram4,
       em.ld,
       em.ldws,
       em.osa,
       em.dl,
       em.hamming,
       em.lcs,
       em.jaccard,
       em.term_cosine,
       em.qgrams2_sim,
       em.qgrams3_sim,
       em.qgrams4_sim,
       em.qgrams5_sim,
       em.qgrams6_sim,
       em.qgrams7_sim,
       em.soundexcode,
       abt.*,
       buy.*
  from 
       entity_match_results em,
       DEMO_Entity_Resolution.Item_About_Embeddings abt,
       DEMO_Entity_Resolution.Item_Buy_Embeddings buy
  where
       em.idAbt = abt.idAbt and
       em.idBuy = buy.idBuy 
) with data
"""

execute_sql(sql)


In [None]:
data = DataFrame('Entities_Final')
data

<p style = 'font-size:16px;font-family:Arial'><b>Create train and test data</b><p style = 'font-size:16px;font-family:Arial'>Now we have transformed our data and it is fit to be used in machine learning models, let us split the whole dataset into train and test sets for model training and scoring. We will use <b>TrainTestSplit</b> function for this task.</p>

In [None]:
TrainTestSplit_out = TrainTestSplit(
                                    data = data,
                                    id_column = "idAbt",
                                    train_size = 0.80,
                                    test_size = 0.20,
                                    seed = 21,
                                    stratify_column = "match"
)

In [None]:
# Split into 2 virtual dataframes
df_train = TrainTestSplit_out.result[TrainTestSplit_out.result['TD_IsTrainRow'] == 1].drop(['TD_IsTrainRow'], axis = 1)
df_test = TrainTestSplit_out.result[TrainTestSplit_out.result['TD_IsTrainRow'] == 0].drop(['TD_IsTrainRow'], axis = 1)

<p style = 'font-size:16px;font-family:Arial'>Save the training and test datasets</p>

In [None]:
copy_to_sql(df_train,
             table_name="Entities_Train_Final",
             if_exists='replace')

In [None]:
copy_to_sql(df_test,
             table_name="Entities_Test_Final",
             if_exists='replace')

<p style = 'font-size:16px;font-family:Arial'>Let us check the positive and negative matches in each of the train and test sets.</p>

In [None]:
print("Match = 1",df_train[df_train['match'] == 1].shape)
print("Match = 0",df_train[df_train['match'] == 0].shape)

Match = 1 (891, 797)
Match = 0 (93781, 797)

In [None]:
print("Match = 1",df_test[df_test['match'] == 1].shape)
print("Match = 0",df_test[df_test['match'] == 0].shape)

Match = 1 (214, 797)
Match = 0 (23231, 797)

<hr style="height:2px;border:none;">
<b style = 'font-size:20px;font-family:Arial;'>5. Create and load H20 Classification Model</b>
<p style = 'font-size:16px;font-family:Arial;'>
</p>

<div class="alert alert-block alert-info">
    <p style = 'font-size:16px;font-family:Arial'><i><b>Note:</b>The H20 model creation is explained in a separate notebook named <b>"Entity_Resolution_Classification_Model_Training.ipynb"</b>.<br> Due to the size of this JupyterLab environment and the amount of data we are processing, the creation of this model will not complete and may result in this JupyterLab becoming non-responsive. This notebook will use a pre-trained model.</i></p>

In [None]:
current_path = os.path.abspath(os.getcwd())
model_path = os.path.join(current_path, "artifacts/XGBoost_1_AutoML_1_pretrained.zip")

In [None]:
model_path

<p style = 'font-size:16px;font-family:Arial;'>If you have created your own model, please update the model_path accordingly.</p>

In [None]:
try:
    db_drop_table("h2o_models")
except:
    True

save_byom(model_id="automl_model", model_file=model_path, table_name="h2o_models")

<p style = 'font-size:16px;font-family:Arial;'><b>BYOM Vantage Scoring on Test Data</b><br>
    Let us take a look at our test dataset

In [None]:
try:
    db_drop_table("entity_scoring_table")
except:
    True
  

execute_sql("""
create table entity_scoring_table
as (
  select ROW_NUMBER() OVER (ORDER BY x.idAbt, x.idBuy) AS row_id,
         x.*
  from       
       Entities_Test_Final x
) with data
""")

In [None]:
df_scoring_input = DataFrame("entity_scoring_table")
df_scoring_input

In [None]:
df_scoring_input.shape

In [None]:
modeldata = retrieve_byom("automl_model", table_name="h2o_models")

configure.byom_install_location = "MLDB"

In [None]:
#taking only 1000 records to save time, running on whole test dataset takes approx 30min
df_scoring =  df_scoring_input[df_scoring_input['row_id'] <= 1000]

In [None]:
result = H2OPredict(newdata=df_scoring,
                    newdata_partition_column='row_id',
                    newdata_order_column='row_id',
                    modeldata=modeldata,
                    modeldata_order_column='model_id',
                    model_output_fields=['classProbabilities'],
                    #model_output_fields=['prob_0','prob_1'],
                    accumulate=['row_id','idAbt','idBuy'],
                    overwrite_cached_models='*',
                    enable_options=['contributions','stageProbabilities'],
                    model_type='OpenSource'
                    )

In [None]:
df_predict = result.result
df_predict.to_sql(table_name='entity_classification_predictions_temp', if_exists='replace')

In [None]:
try:
    db_drop_table("entity_classification_predictions")
except:
    True

execute_sql("""
create table entity_classification_predictions
as(
WITH ranked_rows AS(
   select 
      row_id,
      idAbt,
      idBuy,
      CAST(NEW JSON(classprobabilities).JSONExtractValue('$.0') as FLOAT) AS p0,
      CAST(NEW JSON(classprobabilities).JSONExtractValue('$.1') as FLOAT) AS p1,
      ROW_NUMBER() OVER (PARTITION BY idAbt ORDER BY CAST(NEW JSON(classprobabilities).JSONExtractValue('$.1') as FLOAT) desc) as rnk
   from
      entity_classification_predictions_temp
) 
SELECT * FROM ranked_rows where rnk = 1
) with data
"""); 

In [None]:
try:
    db_drop_table("entity_classification_similarity")
except:
    True
  

execute_sql("""
create table entity_classification_similarity
as (
  select 
         x.idAbt,
         x.idBuy,
         y.p1, 
         -- Cut off Threshold 0.05058710706080187
         CASE WHEN y.p1 >= 0.05058710706080187 THEN 1 else 0 END as prediction,
         x.match,
         x.target_txt,
         x.reference_txt
  from
       entity_match_results x 
       inner join
       entity_classification_predictions y 
           on (x.idAbt = y.idAbt and x.idBuy = y.idBuy)
) with data
""")

In [None]:
final_df = DataFrame("entity_classification_similarity")
final_df

In [None]:
from sklearn.metrics import confusion_matrix
final_df = final_df.to_pandas()
final_df['match'] = final_df['match'].astype('str')
final_df['prediction'] = final_df['prediction'].astype('str')

actual = final_df['match']
predicted = final_df['prediction']

In [None]:
cm = confusion_matrix(actual, predicted)

In [None]:
cm_df = pd.DataFrame(cm, index=['Actual Negative', 'Actual Positive'], columns=['Predicted Negative', 'Predicted Positive'])
print(cm_df)

<hr style="height:2px;border:none;">
<p style = 'font-size:20px;font-family:Arial;'><b>Conclusion</b></p>
<p style = 'font-size:16px;font-family:Arial;'>Entity Resolution powered by <b>Teradata ClearScape Analytics</b> enables organizations to unify and standardize their data with exceptional accuracy, scalability, and efficiency. By combining advanced string similarity techniques, semantic vector embeddings, and in-database machine learning, it effectively reduces duplicates, enhances data quality, and improves trust across multiple systems. This integrated approach builds a strong foundation for analytics, AI, and data governance initiatives, while accelerating insights and reducing manual reconciliation efforts.<br><b>Key Takeaways</b><ul  style = 'font-size:16px;font-family:Arial;'>  
    <li><b>TD_VectorDistance</b> bi-encoder functions (cosine, Euclidean, and Manhattan similarity) enable fast selection of non-match examples for training.</li>
    <li>Maintaining a <b>ground truth “Match Table” </b>s essential; even LLMs can assist in generating this labeled data.</li>
    <li><b>Embeddings generated via BYOM in Vantage </b>combined with ClearScape StringSimilarity functions, provide high-speed, scalable matching.</li>
    <li>A <b>cross-encoder binary classification model  </b>helps evaluate performance through measurable metrics on holdout sets.</li>
    <li>Models can be <b> deployed and inferenced directly in Vantage</b>or exposed through <b>ModelOps endpoints</b> for real-time scoring.</li>
    <li>While this example demonstrates <b>H2O AutoML </b> it can easily be adapted for<b> Scikit-learn, Vantage AutoML, or ClearScape Vantage GLM training </b>— allowing flexibility in production environments.</li>
    <li>Achieved a <b>5–15% accuracy improvement </b>over legacy methods.</li>
    <li><b>Multilingual vector embeddings </b>enable matching across languages, enhancing global applicability.</li>
    <li>Using a<b>Classical AI/ML model head </b>ensures high-speed inference, explainability, and flexibility to experiment with newer embedding models like ModernBERT for continuous improvement.</li>
</ul>
<p  style = 'font-size:16px;font-family:Arial;'> Together these capabilities position Teradata ClearScape Analytics as a comprehensive, production-ready platform for accurate, explainable, and scalable Entity Resolution.

<hr style="height:2px;border:none;">
<b style = 'font-size:20px;font-family:Arial;'>6. Cleanup</b>

<p style = 'font-size:18px;font-family:Arial;'><b>Work Tables</b></p>
<p style = 'font-size:16px;font-family:Arial;'>Cleanup work tables to prevent errors next time.</p>

In [None]:
tables = ['embeddings_models','embeddings_tokenizers','entity_seq','entity_match_results','Entities_Final','Entities_Train_Final','Entities_Test_Final','h2o_models','entity_scoring_table','entity_classification_predictions_temp','entity_classification_predictions','entity_classification_similarity']

# Loop through the list of tables and execute the drop table command for each table
for table in tables:
    try:
        db_drop_table(table_name=table)
    except:
        pass

In [None]:
for i in range(2,8):
    for x in ['About', 'Buy']:
        try:
            db_drop_table(table_name=f"{x}_QGrams{i}")
        except:
            pass


<hr style="height:2px;border:none;">
<p style = 'font-size:18px;font-family:Arial;'> <b>Databases and Tables </b></p>
<p style = 'font-size:16px;font-family:Arial;'>The following code will clean up tables and databases created above.</p>

In [None]:
%run -i ../run_procedure.py "call remove_data('DEMO_Entity_Resolution');"        # Takes 10 seconds

In [None]:
remove_context()

<footer style="padding-bottom:35px; border-bottom:3px solid">
    <div style="float:left;margin-top:14px">ClearScape Analytics™</div>
    <div style="float:right;">
        <div style="float:left; margin-top:14px">
            Copyright © Teradata Corporation - 2025. All Rights Reserved
        </div>
    </div>
</footer>