<header>
   <p  style='font-size:36px;font-family:Arial; color:#F0F0F0; background-color: #00233c; padding-left: 20pt; padding-top: 20pt;padding-bottom: 10pt; padding-right: 20pt;'>
       Entity Resolution with Teradata in Database Embeddings and Analytics : Model Creation
 <br>       
       <img id="teradata-logo" src="https://storage.googleapis.com/clearscape_analytics_demo_data/DEMO_Logo/teradata.svg" alt="Teradata" style="width: 150px; height: auto; margin-top: 20pt;">
  <br>
    </p>
</header>

<p style = 'font-size:20px;font-family:Arial'><b>Introduction:</b></p>
<ul style = 'font-size:16px;font-family:Arial'>
<li>Building a binary classification model on 80% of the feature engineered data using Open Source H2O AutoML. We will use "match" as the target column. The data is 1:100 imbalanced and so we'll enable class imbalance sampling during training. The training dataset <b>Entities_Train_Final</b> is created in the notebook <b>Entity_Resolution_Python.ipynb</b>. Please run that notebook before to create the dataset.</li>
    <li>Please give it an hour for this notebook to run as it trains evaluates many models/grid search/final stacked ensemble</li>
    <li>Final step is to predict on the hold out set and check the confusion matrix (CM) for overfitting</li>
    <li>We will also decide the threshold cutoff by looking at the CM, so we can use that inference decisions on match/no-match</li>
    <li>The final H20 model is saved using save_byom() for inferencing on Vantage later (in notebook Entity_Resolution_Python.ipynb)

<hr style="height:2px;border:none;">
<b style = 'font-size:20px;font-family:Arial;'>1. Initiate a connection to Vantage</b>
<p style = 'font-size:16px;font-family:Arial;'>Let'start by importing required libraries and making connection to Vantage database. You will be prompted to provide the password. Enter your password, press the Enter key, and then use the down arrow to go to the next cell.</p>

In [None]:
import warnings
warnings.filterwarnings('ignore')

import teradataml as tdml
from teradataml import *

import getpass
import time

import os
import logging
import sys
from jdk4py import JAVA, JAVA_HOME, JAVA_VERSION

import plotly.express as px
import plotly.figure_factory as ff
import seaborn as sns
from matplotlib import pyplot as plt
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import classification_report

In [None]:
%run -i ../startup.ipynb
eng = create_context(host = 'host.docker.internal', username='demo_user', password = password)
print(eng)

In [None]:
%%capture
execute_sql('''SET query_band='DEMO=Entity_Resolution_Classification_Model_Creation.ipynb;' UPDATE FOR SESSION; ''')

In [None]:
os.environ['JAVA_HOME'] = '/home/jovyan/.jdk/jdk-17.0.9+9'

In [None]:
os.environ['PATH'] = os.environ['PATH'] + os.pathsep + str(JAVA_HOME)
os.environ['PATH'] = os.environ['PATH'] + os.pathsep + str(JAVA)[:-5]

<hr style="height:2px;border:none;">
<b style = 'font-size:20px;font-family:Arial;'>2. Build a Model using H2O AUTOML.</b>
<p style = 'font-size:16px;font-family:Arial;'>Score the model created on Test Data and Print Metrics.

In [None]:
import h2o
from h2o.automl import H2OAutoML
from h2o.estimators import H2OGradientBoostingEstimator, H2OGeneralizedLinearEstimator
from h2o.grid.grid_search import H2OGridSearch
from h2o.frame import H2OFrame
from h2o.grid.grid_search import H2OGridSearch

import time

In [None]:
# Initialize H2O cluster
try:
    h2o.cluster().shutdown()
except:
    True

time.sleep(5)
h2o.init(nthreads=10, verbose=True, )


<div class="alert alert-block alert-info">
    <p style = 'font-size:16px;font-family:Arial'><i><b>Note:</b>The train and test datasets are created in notebook named <b>Entity_Resolution_Python.ipynb</b>.<br> Please run that notebook till section 5 to create the datasets.</i></p>

In [None]:
train_df = DataFrame("Entities_Train_Final")
test_df = DataFrame("Entities_Test_Final")

train_df = train_df.drop(['idAbt', 'idBuy'], axis = 1)
test_df = test_df.drop(['idAbt', 'idBuy'], axis = 1)

In [None]:
train_df

In [None]:
test_df

In [None]:
train_data = h2o.H2OFrame(train_df.to_pandas())
test_data = h2o.H2OFrame(test_df.to_pandas())

In [None]:
response = 'match'
predictors = train_data.columns
predictors.remove(response)

In [None]:
train_data[response] = train_data[response].asfactor()
test_data[response] = test_data[response].asfactor()

In [None]:
aml = H2OAutoML(max_runtime_secs=3600, 
                nfolds=4, 
                project_name="automl_project",
                max_models=10,
                verbosity = 'info',
                balance_classes=True,
                max_after_balance_size = 5.0,
                stopping_metric="auc",
                sort_metric="auc")

aml.train(x=predictors, y=response, training_frame=train_data,)

In [None]:
print(aml.leaderboard)

In [None]:
leaderboard_df = aml.leaderboard.as_data_frame()

# Filter out the stacked ensemble models
non_ensemble_models = leaderboard_df[~leaderboard_df['model_id'].str.contains("StackedEnsemble")]

# Get the best non-ensemble model
best_non_ensemble_model_id = non_ensemble_models.iloc[0]['model_id']
best_non_ensemble_model = h2o.get_model(best_non_ensemble_model_id)

# Get feature importance
feature_importance = best_non_ensemble_model.varimp(use_pandas=True)

print(f"Feature importance for the best non-ensemble model ({best_non_ensemble_model_id}):")
print(feature_importance)

In [None]:
best_model = aml.leader

# Make predictions on the test data
predictions = best_model.predict(test_data)

# Evaluate the model performance
performance = best_model.model_performance(test_data)

# Print the confusion matrix
print("Confusion Matrix:")
print(performance.confusion_matrix())

In [None]:
thresholds = [0.05, 0.06, 0.1]
performance.confusion_matrix(thresholds = thresholds)

In [None]:
explanations = h2o.explain(aml.leader, 
                           test_data,
                           figsize=(8,3),
                           columns=['emb_euclidean','emb_cosine','emb_manhattan',\
                                    'jaro','jaro_winkler','ngram1','ngram2','ngram3','ngram4',\
                                    'ld','ldws','osa','dl','hamming','lcs','jaccard','term_cosine',\
                                    'qgrams2_sim','qgrams3_sim','qgrams4_sim','qgrams5_sim',\
                                    'qgrams6_sim','qgrams7_sim','soundexcode'])


In [None]:
# Print other evaluation metrics
print("Accuracy:", performance.accuracy(thresholds=[0.05058710706080187])) # Thresholds was picked from CM report
print("AUC:", performance.auc())
print("F1 Score:", performance.F1())
print("Precision:", performance.precision(thresholds=[0.05058710706080187]))
print("Recall:", performance.recall(thresholds=[0.05058710706080187]))

In [None]:
model_path = h2o.save_model(model=best_model, path="artifacts/", force=True)

In [None]:
print(model_path)

In [None]:
artifacts_path = "artifacts/"
mojo = best_model.download_mojo(path=artifacts_path, get_genmodel_jar=False)

In [None]:
os.listdir(artifacts_path)

<p style = 'font-size:16px;font-family:Arial;'>Here we can see the model is saved as a zip file. We can save this model in the notebook Entity_Resolution_Python.ipynb instead of the pretrained model provided.</p>

<hr style="height:2px;border:none;">
<b style = 'font-size:20px;font-family:Arial;'>3. Cleanup</b>

In [None]:
remove_context()

In [None]:
h2o.cluster().shutdown()

<footer style="padding-bottom:35px; background:#f9f9f9; border-bottom:3px solid">
    <div style="float:left;margin-top:14px">ClearScape Analytics™</div>
    <div style="float:right;">
        <div style="float:left; margin-top:14px">
            Copyright © Teradata Corporation - 2025. All Rights Reserved
        </div>
    </div>
</footer>