<header>
   <p  style='font-size:36px;font-family:Arial;color:#F0F0F0; background-color: #00233c; padding-left: 20pt; padding-top: 20pt;padding-bottom: 10pt; padding-right: 20pt;'>
       Streamlining Analytics with Hyper-Segmented Models in Teradata Vantage
  <br>
       <img id="teradata-logo" src="https://storage.googleapis.com/clearscape_analytics_demo_data/DEMO_Logo/teradata.svg" alt="Teradata" style="width: 125px; height: auto; margin-top: 20pt;">
    </p>
</header>

<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Introduction</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Hyper-segmented models are advanced analytical frameworks that divide data into extremely detailed and specific segments, enabling highly customized insights and predictions. The need for hyper-segmented models arises from the growing demand for personalization and precision in decision-making across industries, particularly in customer-centric sectors like retail, finance, and healthcare. Hyper-segmentation allows businesses to extract more granular insights, enabling tailored marketing, personalized product recommendations, and optimized resource allocation. By addressing unique customer behaviors and preferences within specific micro-segments, companies can enhance customer satisfaction, boost operational efficiency, and improve profitability. The ability to deploy and manage hyper-segmented models at scale has become essential for staying competitive in fast-evolving markets.</p>

<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Why Vantage?</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Teradata's Vantage platform offers a streamlined solution by integrating hyper-segmented model management with its parallel processing capabilities. Traditional features like group-by aggregation lay the groundwork for this approach which is further enhanced by advanced SQL functions, unbounded array frameworks (UAF) and Python integration. These capabilities enable businesses to train hundreds of models in parallel with a single Python command, simplifying processes that previously required specialized skills, manual data handling, and extensive coding.<br>
In practical terms, this solution allows organizations to deploy and score multiple models across segmented data sets efficiently. By leveraging Teradata’s script table operator (STO) and external Python libraries, users can handle large-scale model training and scoring, while the interface ensures traceability, security, and flexibility in adjusting hyperparameters without redeployment. Furthermore, data scientists benefit from streamlined workflows, enabling faster iteration and enhanced control over model versioning, lineage, and parameter tracking.</p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>This is a functional demo where we will see how easy it is to <b>deploy and run Python Scikit-Learn Pipeline in Vantage</b>. The dataset used is a generated sample dataset.</p>


<hr style="height:2px;border:none;background-color:#00233C;">

<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>1. Connect to Vantage.</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>In the section, we import the required libraries and set environment variables and environment paths (if required).</p>

In [None]:
%%capture
!pip install --upgrade tdstone2

<div class="alert alert-block alert-info">
<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>Note: </b><i>The above statements may need to be uncommented if you run the notebooks on a platform other than ClearScape Analytics Experience that does not have the libraries installed. If you uncomment those installs, be sure to restart the kernel after executing those lines to bring the installed libraries into memory. The simplest way to restart the Kernel is by typing zero zero: <b> 0 0</b></i></p>
</div>

In [None]:
import json
import getpass
import pandas as pd
import datetime
from teradataml import *

from sklearn.preprocessing import StandardScaler
from sklearn.svm import OneClassSVM
from sklearn.pipeline import Pipeline

from tdstone2.tdshypermodel import HyperModel
from tdstone2.tdstone import TDStone
from tdstone2.utils import cleanup
from tdstone2.tdstone import TDStone

configure.byom_install_location = "mldb"

display.max_rows = 5
warnings.filterwarnings('ignore')
warnings.simplefilter(action='ignore', category=DeprecationWarning)
warnings.simplefilter(action='ignore', category=RuntimeWarning)
warnings.simplefilter(action='ignore', category=FutureWarning)

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We will be prompted to provide the password. We will enter the password, press the Enter key, and then use the down arrow to go to the next cell. Begin running steps with Shift + Enter keys.</p>

In [None]:
%run -i ../startup.ipynb
eng = create_context(host = 'host.docker.internal', username='demo_user', password = password)
print(eng)

In [None]:
%%capture
execute_sql('''SET query_band='DEMO=Hypersegmented_Model.ipynb;' UPDATE FOR SESSION; ''')

<hr style="height:2px;border:none;background-color:#00233C;">

<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>2. Getting Data for This Demo</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We have provided data for this demo on cloud storage. We have the option of either running the demo using foreign tables to access the data without using any storage on our environment or downloading the data to local storage, which may yield somewhat faster execution. However, we need to consider available storage. There are two statements in the following cell, and one is commented out. We may switch which mode we choose by changing the comment string.</p>   


In [None]:
%run -i ../run_procedure.py "call get_data('DEMO_HyperModel_cloud');"
 # Takes about 20 seconds
#%run -i ../run_procedure.py "call get_data('DEMO_HyperModel_local');"
 # Takes about 50 secs

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Optional step – We should execute the below step only if we want to see the status of databases/tables created and space used.</p>

In [None]:
%run -i ../run_procedure.py "call space_report();"

<hr style="height:2px;border:none;background-color:#00233C;">

<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>3. Setup the Framework</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>
External Python scripts (.py files) can run within Vantage using the Script Table Operator (STO). There are few steps involved in setting up the Vantage to run the python script in-Db. The Python package tdstone2 streamlines and simplifies the execution process, enhancing efficiency and ease of use. In the below steps we will configure and prepare our TDStone instance. Cleanup is a precautionary step to clean up any existing objects in the specified schema to ensure a fresh environment. Then we will create our instance and call setup method. The last step is to cross check if the setup is working fine.</p>

In [None]:
cleanup(schema='demo_user')            

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>In the below code we will create a TDSTONE instance in schema specified and the location of the database search path for the SCRIPT execution that will be used in Vantage.</p>

In [None]:
sto = TDStone(schema_name = 'demo_user', SEARCHUIFDBPATH = 'demo_user')
sto.setup()

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>This is an optional step to check if the paths are setup correctly by calling the PushFile method with no file.

In [None]:
sto.PushFile()

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>
Now we are ready to work with this framework, let us now create our Pipeline which we will deploy.</p>

<hr style="height:2px;border:none;background-color:#00233C;">

<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>4. The Hyper-Segmented Dataset</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Let us take a look at the sample dataset we are using.</p>

In [None]:
dataset = DataFrame(in_schema('DEMO_HyperModel', 'Dataset'))
dataset

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Let us group the dataset by the 'Partition_ID' column and counts the number of rows in each partition. </p>

In [None]:
summary = dataset.groupby('Partition_ID').count()
summary

In [None]:
summary.shape

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>As we can see from above we have data in 4 partitions each with 1000 records.</p>

<hr style="height:2px;border:none;background-color:#00233C;">

<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>5. Hyper-segmented model deployment</b></p>


<hr style="height:1px;border:none;background-color:#00233C;">

<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>5.1 Engineering of the scikit-learn classifier pipeline</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'> For our sklearn pipeline we will use 
 `StandardScaler` for feature scaling and a `OneClassSVM` for anomaly detection.</p>

In [None]:
steps_anomaly_detection = [
    ('scaler', StandardScaler()),
    ('one_class_svm', OneClassSVM(
        kernel='rbf',  # Radial Basis Function Kernel
        nu=0.05,       # An upper bound on the fraction of training errors and a lower bound of the fraction of support vectors.
        gamma='auto'   # Kernel coefficient for ‘rbf’, ‘poly’ and ‘sigmoid’. If ‘auto’, 1/n_features will be used.
    ))
]

<hr style="height:1px;border:none;background-color:#00233C;">

<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>5.2 Deployment of the scikit-learn pipeline</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>
We define a dictionary named <b>model_parameters</b> that specifies the categorical columns and the feature columns to be used in the model.</p>

In [None]:
model_parameters = {
    "column_categorical": ['Flag'],
    "column_names_X": ['X1','X2','X3','X4','X5','X6','X7','X8','X9','Flag']
}

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>
    Below code creates an instance of <b>HyperModel</b> with various parameters including the TDStone
    instance, metadata, the pipeline, model parameters, the dataset, and identifiers for rows, partitions, and folds. It also specifies that the model should be converted to ONNX format.</p>

In [None]:
%%time
model = HyperModel(tdstone            = sto,
                   metadata           = {'project': 'test'},
                   skl_pipeline_steps = steps_anomaly_detection,
                   model_parameters   = model_parameters,
                   dataset            = in_schema('DEMO_HyperModel', 'Dataset'),
                   id_row             = 'ID',
                   id_partition       = 'Partition_ID',
                   id_fold            = 'FOLD',
                   fold_training      = 'train',
                   convert_to_onnx    = True
                  )

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>
Let us list all the hyper models available in the TDStone instance.</p>

In [None]:
sto.list_hyper_models()

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>
Below code cell retrieves the ID of the most recently created hyper model by sorting the list of hyper models by creation date in descending order.</p>

In [None]:
id = sto.list_hyper_models()[['ID','CREATION_DATE']].to_pandas().reset_index().sort_values('CREATION_DATE', ascending=False).iloc[0,0]
id

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>
Let us download the model using the ID retrieved in the previous cell.</p>

In [None]:
existing_model = HyperModel(tdstone=sto)
existing_model.download(id=id)

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>
Below code retrieves the code and data associated with the downloaded model, including the data itself, and measures the time taken for this operation.</p>

In [None]:
%%time
code_and_data = existing_model.retrieve_code_and_data(with_data=True)

<hr style="height:1px;border:none;background-color:#00233C;">

<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>5.3 Local Execution for validation/debugging</b></p>


<p style = 'font-size:16px;font-family:Arial;color:#00233C'>
We have downloaded the model in the above step, here we will execute tThis code cell executes the downloaded model.</p>

In [None]:
exec(code_and_data['code'])

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>
Below code cell creates an instance of MyModel using the arguments retrieved from the downloaded model's code and data.</p>

In [None]:
local_model = MyModel(**code_and_data['arguments'])

<p style = 'font-size:16px;font-family:Arial;color:#00233C'> Convert the 'Flag' column in the local data to a categorical type.</p>

In [None]:
df_local = code_and_data['data']
df_local['Flag'] = df_local['Flag'].astype('category')

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>
Let us train the downloaded model and measure the time taken for this operation.</p>

In [None]:
%%time
local_model.fit(code_and_data['data'][code_and_data['data']['FOLD'] == 'train'])

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>
Below code cell scores the local model using the entire dataset and measures the time taken for this operation.</p>

In [None]:
%%time
local_model.score(code_and_data['data'])

In [None]:
local_model.get_description()

<hr style="height:2px;border:none;background-color:#00233C;">

<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>6. Execution of the deployed HyperModel</b></p> 

<hr style="height:1px;border:none;background-color:#00233C;">

<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>6.1 Models Training</b></p>


<p style = 'font-size:16px;font-family:Arial;color:#00233C'>
Below code trains the hyper model and measures the time taken for this operation.</p>

In [None]:
%%time
model.train()

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>
Below code retrieves the trained models from the model object, groups them by TD_TIMECODE and MODEL_TYPE, counts the number of models in each group, and sorts the results in descending order based on TD_TIMECODE and MODEL_TYPE.
</p>



In [None]:
model.get_trained_models().groupby(['TD_TIMECODE','MODEL_TYPE']).count().sort(['TD_TIMECODE','MODEL_TYPE'],ascending=False)

<hr style="height:1px;border:none;background-color:#00233C;">

<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>6.2 Model Scoring</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>
Let us score the model and measure the time taken. </p>


In [None]:
%%time
model.score()

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>
Below retrieves the model predictions from the model object, groups them by TD_TIMECODE, and counts the number of predictions in each group.
</p>

In [None]:
model.get_model_predictions().groupby('TD_TIMECODE').count()

In [None]:
model.get_model_predictions()

<hr style="height:2px;border:none;background-color:#00233C;">

<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>7. Model Lineage</b></p>

<hr style="height:1px;border:none;background-color:#00233C;">

<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>7.1 Access to the list of deployed codes</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>
Below code lists all the codes available in the sto object.
</p>

In [None]:
sto.list_codes()

<hr style="height:1px;border:none;background-color:#00233C;">

<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>7.2 List of deployed models (code + parameters)</b></p>


<p style = 'font-size:16px;font-family:Arial;color:#00233C'>
Below code lists all the models available in the sto object.
</p>


In [None]:
sto.list_models()

<hr style="height:1px;border:none;background-color:#00233C;">

<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>7.3 List of available mappers (mapping between partitions and models or trained models)</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>
Below code lists all the mappers available in the sto object.
</p>

In [None]:
sto.list_mappers()

<hr style="height:1px;border:none;background-color:#00233C;">

<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>7.4 List of Hypermodels ( models and mappers mapping)</b></p>
 

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>
Below code lists all the hyper models available in the sto object.
</p>


In [None]:
sto.list_hyper_models()

<hr style="height:2px;border:none;background-color:#00233C;">

<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>8. Use BYOM with ONNX models</b></p>


<p style = 'font-size:16px;font-family:Arial;color:#00233C'>
Below code retrieves the BYOM (Bring Your Own Model) catalog from the model object and stores it in the onnx_catalog variable. </p>

In [None]:
onnx_catalog = model.get_byom_catalog()
onnx_catalog

In [None]:
onnx_catalog_local = onnx_catalog[['model_id','Partition_ID']].to_pandas(num_rows=1)
onnx_catalog_local

In [None]:
Partition_ID     = onnx_catalog_local.Partition_ID.values[0]
model_id         = onnx_catalog_local.model_id.values[0]

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>
Let us create an instance of ONNXPredict to generate predictions using the ONNX model. </p>

In [None]:
predictions_ = ONNXPredict(
    modeldata               = onnx_catalog[onnx_catalog.Partition_ID == Partition_ID],
    newdata                 = dataset[dataset.Partition_ID == Partition_ID],
    accumulate              = ['Partition_ID', 'ID'],
    overwrite_cached_models = '*',
)

In [None]:
predictions_.result

In [None]:
#Let's extract the scores field of the json
query = f"""
SELECT Partition_ID, ID, CAST("json_report" as JSON).scores[0][0] as score 
from {predictions_.result._table_name}
"""
print(query)
DataFrame.from_query(query)

<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Conclusion</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>In this demo we have seen the integration of Teradata's machine learning capabilities with `scikit-learn` , the use of ONNX for model deployment, and the comprehensive management of models and predictions within the Teradata environment. This approach leverages both Teradata's database strengths and the flexibility of Python's machine learning libraries.</p>

<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>9. Cleanup</b></p>


<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Work Tables/Views</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We need to clean up our work tables/views to prevent errors next time.</p>

In [None]:
# Loop through the list of views and execute the drop view command for each view
for view in db_list_tables(object_name='%TDS%', object_type='view')['TableName'].tolist():
    try:
        db_drop_view(view_name=view, schema_name="demo_user")
        #print(view)
    except:
        pass

In [None]:
# Loop through the list of tables and execute the drop table command for each table
for table in db_list_tables(object_name='%TDS%', object_type='table')['TableName'].tolist():
    try:
        db_drop_table(table_name=table, schema_name="demo_user")
        #print(table)
    except:
        pass

<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Databases and Tables</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The following code will clean up tables and databases created above.</p>

In [None]:
%run -i ../run_procedure.py "call remove_data('DEMO_HyperModel');" 
#Takes 40 seconds

In [None]:
remove_context()

<footer style="padding-bottom:35px; background:#f9f9f9; border-bottom:3px solid #00233C">
<div style="float:left;margin-top:14px">ClearScape Analytics™</div>
<div style="float:right;">
<div style="float:left; margin-top:14px">
            Copyright © Teradata Corporation - 2024. All Rights Reserved
</div>
</div>
</footer>