# TechBytes: Using Python with Teradata Vantage
## Part 3: Modeling with Vantage Analytic Functions - Model Cataloging

The contents of this file are Teradata Public Content and have been released to the Public Domain.
Please see _license.txt_ file in the package for more information.

Alexander Kolovos and Tim Miller - May 2021 - v.2.0 \
Copyright (c) 2021 by Teradata \
Licensed under BSD

This TechByte demonstrates how to
* invoke and use Vantage analytic functions through their teradataml Python wrapper functions.
* use options to display the actual SQL query submitted by teradataml to the Database.
* persist analytical results in teradataml DataFrames as Database tables.
* train and score models in-Database with Vantage analytic functions. A use case is shown with XGBoost and Decision Forest analyses, where we employ Vantage Machine Learning (ML) Engine analytic functions to predict the propensity of bank customers to open a new credit card account. The example further demonstrates a comparison of the 2 models via confusion matrix analysis.
* save, inspect, retrieve, and reuse models created with Vantage analytic functions by means of the teradataml Model Cataloging feature.

_Note_: To use Model Cataloging on your target Advanced SQL Engine, visit first the teradataml page on the website downloads.teradata.com, and ask your Database administrator to install and enable this feature on your Vantage system.

Contributions by:
- Alexander Kolovos, Sr Staff Software Architect, Teradata Product Engineering / Vantage Cloud and Applications.
- Tim Miller, Principal Software Architect, Teradata Product Management / Advanced Analytics.

### Initial Steps: Load libraries and create a Vantage connection

In [1]:
# Load teradataml and dependency packages.
#
import os
import getpass as gp

from teradataml import create_context, remove_context, get_context
from teradataml import DataFrame, copy_to_sql, in_schema
from teradataml.options.display import display

from teradataml import XGBoost, XGBoostPredict, ConfusionMatrix
from teradataml import DecisionForest, DecisionForestEvaluator, DecisionForestPredict

from teradataml import save_model, list_models, describe_model, retrieve_model
from teradataml import publish_model, delete_model

import pandas as pd
import numpy as np

In [2]:
# Specify a Teradata Vantage server to connect to. In the following statement, 
# replace the following argument values with strings as follows:
# <HOST>   : Specify your target Vantage system hostname (or IP address).
# <UID>    : Specify your Database username.
# <PWD>    : Specify your password. You can also use encrypted passwords via
#            the Stored Password Protection feature.
#con = create_context(host = <HOST>, username = <UID>, password = <PWD>, 
#                     database = <DB_Name>, "temp_database_name" = <Temp_DB_Name>)
#
con = create_context(host = "tdprd.td.teradata.com", username = "ak186064",
                            password = gp.getpass(prompt='Password:'), 
                            logmech = "LDAP", database = "TRNG_TECHBYTES",
                            temp_database_name = "ak186064")

Password: ··············


In [3]:
# Create a teradataml DataFrame from the ADS we need, and take a glimpse at it.
#
td_ADS_Py = DataFrame("ak_TBv2_ADS_Py")
td_ADS_Py.to_pandas().head(5)

Unnamed: 0,cust_id,income,age,tot_cust_years,tot_children,female_ind,single_ind,married_ind,separated_ind,state_code,...,ck_avg_bal,sv_avg_bal,cc_avg_bal,ck_avg_tran_amt,sv_avg_tran_amt,cc_avg_tran_amt,q1_trans_cnt,q2_trans_cnt,q3_trans_cnt,q4_trans_cnt
0,28617939,5724.2,70,10,1,1,0,1,0,CA,...,291.728969,291.728969,291.728969,2.879832,2.879832,2.879832,384,57,69,72
1,27252140,48900.0,44,6,1,0,0,1,0,OTHER,...,4875.089945,0.0,4875.089945,14.125061,14.125061,14.125061,220,28,30,48
2,23166920,31020.6,73,9,0,0,0,1,0,CA,...,6059.145,0.0,0.0,-28.884351,-28.884351,0.0,15,14,17,11
3,19077268,6445.6,66,6,1,1,0,1,0,NY,...,4281.467,0.0,0.0,-5.011717,-5.011717,0.0,54,45,0,0
4,23168535,1625.2,39,5,0,1,0,1,0,TX,...,1430.087142,1430.087142,0.0,41.337208,41.337208,0.0,80,100,58,22


In [4]:
# Split the ADS into 2 samples, each with 60% and 40% of total rows.
# Use the 60% sample to train, and the 40% sample to test/score.
# Persist the samples as tables in the Database, and create DataFrames.
#
td_Train_Test_ADS = td_ADS_Py.sample(frac = [0.6, 0.4])

Train_ADS = td_Train_Test_ADS[td_Train_Test_ADS.sampleid == "1"]
copy_to_sql(Train_ADS, table_name="ak_TBv2_Train_ADS_Py", if_exists="replace")
td_Train_ADS = DataFrame("ak_TBv2_Train_ADS_Py")

Test_ADS = td_Train_Test_ADS[td_Train_Test_ADS.sampleid == "2"]
copy_to_sql(Test_ADS, table_name="ak_TBv2_Test_ADS_Py", if_exists="replace")
td_Test_ADS = DataFrame("ak_TBv2_Test_ADS_Py")

### 1. Using the ML Engine analytic functions

Assume the use case of predicting credit card account ownership based on independent variables of interest. We will be training models, scoring the test data with them, comparing models and storing them for retrieval.

In [5]:
# Use the teradataml option to print the SQL code of calls to Advanced SQL
# or ML Engines analytic functions.
#
display.print_sqlmr_query = True

#### 1.1. Model training and scoring with XGBoost

In [6]:
# First, construct a formula to predict Credit Card account ownership based on
# the following independent variables of interest:
#
formula = "cc_acct_ind ~ income + age + tot_cust_years + tot_children + female_ind + single_ind " \
          "+ married_ind + separated_ind + ca_resident_ind + ny_resident_ind + tx_resident_ind " \
          "+ il_resident_ind + az_resident_ind + oh_resident_ind + ck_acct_ind + sv_acct_ind " \
          "+ ck_avg_bal + sv_avg_bal + ck_avg_tran_amt + sv_avg_tran_amt"

# Then, train an XGBoost model to predict Credit Card account ownership on the
# basis of the above formula.
#
td_xgboost_model = XGBoost(data = td_Train_ADS,
                           id_column = 'cust_id',
                           formula = formula,
                           num_boosted_trees = 4,
                           loss_function = 'binomial',
                           prediction_type = 'classification',
                           reg_lambda =1.0,
                           shrinkage_factor = 0.1,
                           iter_num = 10,
                           min_node_size = 1,
                           max_depth = 6
                           )
#print(td_xgboost_model)
print("Training complete.")

SELECT * FROM XGBoost(
	ON "ak_TBv2_Train_ADS_Py" AS InputTable
	OUT TABLE OutputTable(ak186064.ml__td_xgboost0_162077915941196)
	USING
	IdColumn('cust_id')
	NumBoostedTrees('4')
	LossFunction('binomial')
	PredictionType('classification')
	MaxDepth('6')
	ResponseColumn('cc_acct_ind')
	NumericInputs('income','age','tot_cust_years','tot_children','female_ind','single_ind','married_ind','separated_ind','ca_resident_ind','ny_resident_ind','tx_resident_ind','il_resident_ind','az_resident_ind','oh_resident_ind','ck_acct_ind','sv_acct_ind','ck_avg_bal','sv_avg_bal','ck_avg_tran_amt','sv_avg_tran_amt')
) as sqlmr
Training complete.


In [7]:
# Score the XGBoost model against the holdout and compare actuals to predicted.
#
td_xgboost_predict = XGBoostPredict(td_xgboost_model,
                                    newdata = td_Test_ADS,
                                    object_order_column = ['tree_id','iter','class_num'],
                                    id_column = 'cust_id',
                                    terms = 'cc_acct_ind',
                                    num_boosted_trees = 4
                                    )

# Persist the XGBoostPredict output
#
try:
    db_drop_table("ak_TBv2_Py_XGBoost_Scores")
except:
    pass

td_xgboost_predict.result.to_sql(if_exists = "replace", table_name = "ak_TBv2_Py_XGBoost_Scores")
td_XGBoost_Scores = DataFrame("ak_TBv2_Py_XGBoost_Scores")
td_XGBoost_Scores.head(5)

SELECT * FROM XGBoostPredict(
	ON "ak_TBv2_Test_ADS_Py" AS "input"
	PARTITION BY ANY 
	ON "ak186064"."ml__td_xgboost0_162077915941196" AS ModelTable
	DIMENSION
	ORDER BY "tree_id","iter","class_num"
	USING
	IdColumn('cust_id')
	Accumulate('cc_acct_ind')
	NumBoostedTrees('4')
) as sqlmr


    cust_id  cc_acct_ind prediction  confidence_lower  confidence_upper
0  13624960            0          0               1.0               1.0
1  13625020            0          1               1.0               1.0
2  13624980            0          1               1.0               1.0
3  13624860            0          1               1.0               1.0
4  13624840            1          1               1.0               1.0

#### 1.2. Model training and scoring with Decision Forests

In [8]:
# In a different approach, train a Decicion Forests model to predict the same
# target, so we can compare and see which algorithm fits best the data.
#
td_decisionforest_model = DecisionForest(formula = formula,
                                         data = td_Train_ADS,
                                         tree_type = "classification",
                                         ntree = 500,
                                         nodesize = 1,
                                         variance = 0.0,
                                         max_depth = 12,
                                         mtry = 5,
                                         mtry_seed = 100,
                                         seed = 100
                                         )
#print(td_decisionforest_model)
print("Training complete.")

SELECT * FROM DecisionForest(
	ON "ak_TBv2_Train_ADS_Py" AS InputTable
	OUT TABLE OutputTable(ak186064.ml__td_decisionforest0_162078106630228)
	OUT TABLE MonitorTable(ak186064.ml__td_decisionforest1_162077867853955)
	USING
	TreeType('classification')
	NumTrees('500')
	MaxNumCategoricalValues('1000')
	Mtry('5')
	MtrySeed('100')
	Seed('100')
	ResponseColumn('cc_acct_ind')
	NumericInputs('income','age','tot_cust_years','tot_children','female_ind','single_ind','married_ind','separated_ind','ca_resident_ind','ny_resident_ind','tx_resident_ind','il_resident_ind','az_resident_ind','oh_resident_ind','ck_acct_ind','sv_acct_ind','ck_avg_bal','sv_avg_bal','ck_avg_tran_amt','sv_avg_tran_amt')
) as sqlmr
Training complete.


In [9]:
# Call the DecisionForestEvaluator() function to determine the most important
# variables in the Decision Forest model.
#
td_decisionforest_model_evaluator = DecisionForestEvaluator(object = td_decisionforest_model,
                                                            num_levels = 5)

# In the following, the describe() method provides summary statistics across
# trees over grouping by each variable. One can consider the mean importance
# across all trees as the importance for each variable.
#
td_variable_importance = td_decisionforest_model_evaluator.result.select(["variable_col", "importance"]).groupby("variable_col").describe()

print(td_variable_importance)
#print("Variable importance analysis complete.")

SELECT * FROM DecisionForestEvaluator(
	ON "ak186064"."ml__td_decisionforest0_162078106630228" AS "input"
	PARTITION BY ANY
) as sqlmr
                      importance
variable_col    func            
age             25%         .089
                50%         .255
                75%         .459
                count        360
                max        1.123
                mean        .295
                min         .006
                std         .228
az_resident_ind 25%         .018
                50%         .031
                75%         .046
                count         35
                max         .245
                mean        .041
                min         .003
                std         .044
ca_resident_ind 25%          .02
                50%         .081
                75%         .161
                count        127
                max         .665
                mean        .127
                min         .001
                std         .146
ck_acct

In [10]:
# Score the Decision Forest model
#
td_decisionforest_predict = DecisionForestPredict(td_decisionforest_model,
                                                  newdata = td_Test_ADS,
                                                  id_column = "cust_id",
                                                  detailed = False,
                                                  terms = ["cc_acct_ind"]
                                                  )

# Persist the DecisionForestPredict output
try:
    db_drop_table("ak_TBv2_Py_DecisionForest_Scores")
except:
    pass

copy_to_sql(td_decisionforest_predict.result, if_exists = "replace", 
            table_name="ak_TBv2_Py_DecisionForest_Scores")
td_DecisionForest_Scores = DataFrame("ak_TBv2_Py_DecisionForest_Scores")
td_DecisionForest_Scores.head(5)

SELECT * FROM DecisionForestPredict(
	ON "ak_TBv2_Test_ADS_Py" AS "input"
	PARTITION BY ANY 
	ON "ak186064"."ml__td_decisionforest0_162078106630228" AS ModelTable
	DIMENSION
	USING
	IdColumn('cust_id')
	Accumulate('cc_acct_ind')
) as sqlmr


   cc_acct_ind   cust_id prediction  confidence_lower  confidence_upper
0            0  13629760          1          0.800412          0.800412
1            0  14997136          1          0.674897          0.674897
2            0  16355184          1          0.814815          0.814815
3            0  14994595          1          0.790123          0.790123
4            0  24539364          0          0.510288          0.510288

#### 1.3. Inspect the 2 modeling approaches through their Confusion Matrix

In [11]:
# Look at the confusion matrix for the XGBoost model.
#
confusion_matrix_XGB = ConfusionMatrix(data = td_XGBoost_Scores,
                                       reference = "cc_acct_ind",
                                       prediction = "prediction"
                                      )
print(confusion_matrix_XGB)

SELECT * FROM ConfusionMatrix(
	ON "ak_TBv2_Py_XGBoost_Scores" AS "input"
	PARTITION BY 1
	OUT TABLE CountTable(ak186064.ml__td_confusionmatrix0_162078240563546)
	OUT TABLE StatTable(ak186064.ml__td_confusionmatrix1_162077463074568)
	OUT TABLE AccuracyTable(ak186064.ml__td_confusionmatrix2_162078667440282)
	USING
	ObservationColumn('cc_acct_ind')
	PredictColumn('prediction')
) as sqlmr
############ STDOUT Output ############

Empty DataFrame
Columns: []
Index: [The result has been outputted to output tables, Success !]


############ counttable Output ############

  observation    0     1
0           1  165  2445
1           0  510   598


############ stattable Output ############

                    key             value
0       Null Error Rate             0.298
1                 Kappa            0.4474
2  Mcnemar Test P-Value                 0
3   P-Value [Acc > NIR]                 0
4                95% CI  (0.7814, 0.8077)
5              Accuracy            0.7948


###########

In [12]:
# Look at the confusion matrix for Random Forest model.
#
confusion_matrix_DF = ConfusionMatrix(data = td_DecisionForest_Scores,
                                      reference = "cc_acct_ind",
                                      prediction = "prediction"
                                     )
print(confusion_matrix_DF)

SELECT * FROM ConfusionMatrix(
	ON "ak_TBv2_Py_DecisionForest_Scores" AS "input"
	PARTITION BY 1
	OUT TABLE CountTable(ak186064.ml__td_confusionmatrix0_162083681916967)
	OUT TABLE StatTable(ak186064.ml__td_confusionmatrix1_162077210897868)
	OUT TABLE AccuracyTable(ak186064.ml__td_confusionmatrix2_162077772381189)
	USING
	ObservationColumn('cc_acct_ind')
	PredictColumn('prediction')
) as sqlmr
############ STDOUT Output ############

Empty DataFrame
Columns: []
Index: [The result has been outputted to output tables, Success !]


############ counttable Output ############

  observation    0     1
0           1   31  2579
1           0  326   782


############ stattable Output ############

                    key             value
0       Null Error Rate             0.298
1                 Kappa            0.3508
2  Mcnemar Test P-Value                 0
3   P-Value [Acc > NIR]                 0
4                95% CI  (0.7677, 0.7945)
5              Accuracy            0.7813


####

### 2. Model Cataloging
Tools to save, inspect, retrieve, and reuse models created either in the Advanced SQL Engine or the ML Engine.

In [13]:
# Save the XGBoost and Decision Forest models.
#
save_model(model = td_xgboost_model, name = "ak_TBv2_Py_CC_XGB_model", 
           description = "TechBytes (Python): XGBoost for credit card analysis")
save_model(model = td_decisionforest_model, name = "ak_TBv2_Py_CC_DF_model", 
           description = "TechBytes (Python): DF for credit card analysis")

Persisting model information.
Persisted table: "ak186064"."ml__td_xgboost0_162077915941196"
Persisted table: "ak186064"."ml__td_sqlmr_out__162076768057945"
Successfully persisted model.
Persisting model information.
Persisted table: "ak186064"."ml__td_decisionforest0_162078106630228"
Persisted table: "ak186064"."ml__td_decisionforest1_162077867853955"
Persisted table: "ak186064"."ml__td_sqlmr_out__162080888675025"
Successfully persisted model.


In [14]:
# Inspect presently saved models.
#
list_models()

                 ModelName  ModelAlgorithm ModelGeneratingEngine ModelGeneratingClient CreatedBy                 CreatedDate
0   ak_TBv2_Py_CC_DF_model  DecisionForest             ML Engine            teradataml  AK186064  2021-05-03 17:09:27.880000
1  ak_TBv2_Py_CC_XGB_model         XGBoost             ML Engine            teradataml  AK186064  2021-05-03 17:09:11.100000


In [15]:
# Print details about a specific model.
#
describe_model(name = "ak_TBv2_Py_CC_DF_model")



*** 'ak_TBv2_Py_CC_DF_model': Model Details ***
ModelName                                       ak_TBv2_Py_CC_DF_model
ModelDescription       TechBytes (Python): DF for credit card analysis
ModelAlgorithm                                          DecisionForest
ModelPredictionType                                     CLASSIFICATION
ModelTargetColumn                                          cc_acct_ind
ModelProject                                                      None
ModelEntityTarget                                                 None
ModelGeneratingEngine                                        ML Engine
ModelGeneratingClient                                       teradataml
ModelAccess                                                    Private
ModelStatus                                             In-Development
ModelBuildTime                                                       0
ModelLocation                                      Advanced SQL Engine
CreatedBy                  

In [16]:
# Recreate a teradataml Analytic Function object from the information saved
# with the Model Catalog 
td_retrieved_DF_model = retrieve_model("ak_TBv2_Py_CC_DF_model")

In [17]:
# Assume that on the basis of the earlier model comparison, we choose to keep
# the Decision Forests model and discard the XGBoost one.
#
# The publish_model() function enables sharing the selected models with
# other users, and specifying a status among the available options
# of "In-Development", "Candidate", "Active", "Production", and "Retired".
#
publish_model("ak_TBv2_Py_CC_DF_model", grantee = "public", status = "Active")

Model published successfully!
Please execute the following GRANT statements:
GRANT SELECT ON "ak186064"."ml__td_sqlmr_out__162080888675025" to public;
GRANT SELECT ON "ak186064"."ml__td_decisionforest1_162077867853955" to public;
GRANT SELECT ON "ak186064"."ml__td_decisionforest0_162078106630228" to public;


In [18]:
# Discarding a model no longer needed.
#
delete_model("ak_TBv2_Py_CC_DF_model")
delete_model("ak_TBv2_Py_CC_XGB_model")

Deleted model 'ak_TBv2_Py_CC_DF_model' successfully.
Model Objects that can be dropped: ['"ak186064"."ml__td_sqlmr_out__162080888675025"', '"ak186064"."ml__td_decisionforest1_162077867853955"', '"ak186064"."ml__td_decisionforest0_162078106630228"'].
Deleted model 'ak_TBv2_Py_CC_XGB_model' successfully.
Model Objects that can be dropped: ['"ak186064"."ml__td_sqlmr_out__162076768057945"', '"ak186064"."ml__td_xgboost0_162077915941196"'].


### End of session

In [19]:
# Remove the context of present teradataml session and terminate the Python
# session. It is recommended to call the remove_context() function for session
# cleanup. Temporary objects are removed at the end of the session.
#
remove_context()

True