# TechBytes: Using Python with Teradata Vantage
## Part 3: Modeling with Vantage Analytic Functions - Model Cataloging

The contents of this file are Teradata Public Content and have been released to the Public Domain.
Please see _license.txt_ file in the package for more information.

Alexander Kolovos and Tim Miller - May 2021 - v.2.0 \
Copyright (c) 2021 by Teradata \
Licensed under BSD

This TechByte demonstrates how to
* invoke and use Vantage analytic functions through their teradataml Python wrapper functions.
* use options to display the actual SQL query submitted by teradataml to the Database.
* persist analytical results in teradataml DataFrames as Database tables.
* train and score models in-Database with Vantage analytic functions. A use case is shown with XGBoost and Decision Forest analyses, where we employ Vantage Machine Learning (ML) Engine analytic functions to predict the propensity of bank customers to open a new credit card account. The example further demonstrates a comparison of the 2 models via confusion matrix analysis.
* save, inspect, retrieve, and reuse models created with Vantage analytic functions by means of the teradataml Model Cataloging feature.

_Note_: To use Model Cataloging on your target Advanced SQL Engine, visit first the teradataml page on the website downloads.teradata.com, and ask your Database administrator to install and enable this feature on your Vantage system.

Contributions by:
- Alexander Kolovos, Sr Staff Software Architect, Teradata Product Engineering / Vantage Cloud and Applications.
- Tim Miller, Principal Software Architect, Teradata Product Management / Advanced Analytics.

### Initial Steps: Load libraries and create a Vantage connection

In [None]:
# Load teradataml and dependency packages.
#
import os
import getpass as gp

from teradataml import create_context, remove_context, get_context
from teradataml import DataFrame, copy_to_sql, in_schema
from teradataml.options.display import display

from teradataml import XGBoost, XGBoostPredict, ConfusionMatrix
from teradataml import DecisionForest, DecisionForestEvaluator, DecisionForestPredict

from teradataml import save_model, list_models, describe_model, retrieve_model
from teradataml import publish_model, delete_model

import pandas as pd
import numpy as np

In [None]:
# Specify a Teradata Vantage server to connect to. In the following statement, 
# replace the following argument values with strings as follows:
# <HOST>   : Specify your target Vantage system hostname (or IP address).
# <UID>    : Specify your Database username.
# <PWD>    : Specify your password. You can also use encrypted passwords via
#            the Stored Password Protection feature.
#con = create_context(host = <HOST>, username = <UID>, password = <PWD>, 
#                     database = <DB_Name>, "temp_database_name" = <Temp_DB_Name>)
#
con = create_context(host = "<Host_Name>", username = "<Username>",
                            password = gp.getpass(prompt='Password:'), 
                            logmech = "LDAP", database = "TRNG_TECHBYTES",
                            temp_database_name = "<Database_Name>")

In [None]:
# Create a teradataml DataFrame from the ADS we need, and take a glimpse at it.
#
td_ADS_Py = DataFrame("ak_TBv2_ADS_Py")
td_ADS_Py.to_pandas().head(5)

In [None]:
# Split the ADS into 2 samples, each with 60% and 40% of total rows.
# Use the 60% sample to train, and the 40% sample to test/score.
# Persist the samples as tables in the Database, and create DataFrames.
#
td_Train_Test_ADS = td_ADS_Py.sample(frac = [0.6, 0.4])

Train_ADS = td_Train_Test_ADS[td_Train_Test_ADS.sampleid == "1"]
copy_to_sql(Train_ADS, table_name="ak_TBv2_Train_ADS_Py", if_exists="replace")
td_Train_ADS = DataFrame("ak_TBv2_Train_ADS_Py")

Test_ADS = td_Train_Test_ADS[td_Train_Test_ADS.sampleid == "2"]
copy_to_sql(Test_ADS, table_name="ak_TBv2_Test_ADS_Py", if_exists="replace")
td_Test_ADS = DataFrame("ak_TBv2_Test_ADS_Py")

### 1. Using the ML Engine analytic functions

Assume the use case of predicting credit card account ownership based on independent variables of interest. We will be training models, scoring the test data with them, comparing models and storing them for retrieval.

In [None]:
# Use the teradataml option to print the SQL code of calls to Advanced SQL
# or ML Engines analytic functions.
#
display.print_sqlmr_query = True

#### 1.1. Model training and scoring with XGBoost

In [None]:
# First, construct a formula to predict Credit Card account ownership based on
# the following independent variables of interest:
#
formula = "cc_acct_ind ~ income + age + tot_cust_years + tot_children + female_ind + single_ind " \
          "+ married_ind + separated_ind + ca_resident_ind + ny_resident_ind + tx_resident_ind " \
          "+ il_resident_ind + az_resident_ind + oh_resident_ind + ck_acct_ind + sv_acct_ind " \
          "+ ck_avg_bal + sv_avg_bal + ck_avg_tran_amt + sv_avg_tran_amt"

# Then, train an XGBoost model to predict Credit Card account ownership on the
# basis of the above formula.
#
td_xgboost_model = XGBoost(data = td_Train_ADS,
                           id_column = 'cust_id',
                           formula = formula,
                           num_boosted_trees = 4,
                           loss_function = 'binomial',
                           prediction_type = 'classification',
                           reg_lambda =1.0,
                           shrinkage_factor = 0.1,
                           iter_num = 10,
                           min_node_size = 1,
                           max_depth = 6
                           )
#print(td_xgboost_model)
print("Training complete.")

In [None]:
# Score the XGBoost model against the holdout and compare actuals to predicted.
#
td_xgboost_predict = XGBoostPredict(td_xgboost_model,
                                    newdata = td_Test_ADS,
                                    object_order_column = ['tree_id','iter','class_num'],
                                    id_column = 'cust_id',
                                    terms = 'cc_acct_ind',
                                    num_boosted_trees = 4
                                    )

# Persist the XGBoostPredict output
#
try:
    db_drop_table("ak_TBv2_Py_XGBoost_Scores")
except:
    pass

td_xgboost_predict.result.to_sql(if_exists = "replace", table_name = "ak_TBv2_Py_XGBoost_Scores")
td_XGBoost_Scores = DataFrame("ak_TBv2_Py_XGBoost_Scores")
td_XGBoost_Scores.head(5)

#### 1.2. Model training and scoring with Decision Forests

In [None]:
# In a different approach, train a Decicion Forests model to predict the same
# target, so we can compare and see which algorithm fits best the data.
#
td_decisionforest_model = DecisionForest(formula = formula,
                                         data = td_Train_ADS,
                                         tree_type = "classification",
                                         ntree = 500,
                                         nodesize = 1,
                                         variance = 0.0,
                                         max_depth = 12,
                                         mtry = 5,
                                         mtry_seed = 100,
                                         seed = 100
                                         )
#print(td_decisionforest_model)
print("Training complete.")

In [None]:
# Call the DecisionForestEvaluator() function to determine the most important
# variables in the Decision Forest model.
#
td_decisionforest_model_evaluator = DecisionForestEvaluator(object = td_decisionforest_model,
                                                            num_levels = 5)

# In the following, the describe() method provides summary statistics across
# trees over grouping by each variable. One can consider the mean importance
# across all trees as the importance for each variable.
#
td_variable_importance = td_decisionforest_model_evaluator.result.select(["variable_col", "importance"]).groupby("variable_col").describe()

print(td_variable_importance)
#print("Variable importance analysis complete.")

In [None]:
# Score the Decision Forest model
#
td_decisionforest_predict = DecisionForestPredict(td_decisionforest_model,
                                                  newdata = td_Test_ADS,
                                                  id_column = "cust_id",
                                                  detailed = False,
                                                  terms = ["cc_acct_ind"]
                                                  )

# Persist the DecisionForestPredict output
try:
    db_drop_table("ak_TBv2_Py_DecisionForest_Scores")
except:
    pass

copy_to_sql(td_decisionforest_predict.result, if_exists = "replace", 
            table_name="ak_TBv2_Py_DecisionForest_Scores")
td_DecisionForest_Scores = DataFrame("ak_TBv2_Py_DecisionForest_Scores")
td_DecisionForest_Scores.head(5)

#### 1.3. Inspect the 2 modeling approaches through their Confusion Matrix

In [None]:
# Look at the confusion matrix for the XGBoost model.
#
confusion_matrix_XGB = ConfusionMatrix(data = td_XGBoost_Scores,
                                       reference = "cc_acct_ind",
                                       prediction = "prediction"
                                      )
print(confusion_matrix_XGB)

In [None]:
# Look at the confusion matrix for Random Forest model.
#
confusion_matrix_DF = ConfusionMatrix(data = td_DecisionForest_Scores,
                                      reference = "cc_acct_ind",
                                      prediction = "prediction"
                                     )
print(confusion_matrix_DF)

### 2. Model Cataloging
Tools to save, inspect, retrieve, and reuse models created either in the Advanced SQL Engine or the ML Engine.

In [None]:
# Save the XGBoost and Decision Forest models.
#
save_model(model = td_xgboost_model, name = "ak_TBv2_Py_CC_XGB_model", 
           description = "TechBytes (Python): XGBoost for credit card analysis")
save_model(model = td_decisionforest_model, name = "ak_TBv2_Py_CC_DF_model", 
           description = "TechBytes (Python): DF for credit card analysis")

In [None]:
# Inspect presently saved models.
#
list_models()

In [None]:
# Print details about a specific model.
#
describe_model(name = "ak_TBv2_Py_CC_DF_model")

In [None]:
# Recreate a teradataml Analytic Function object from the information saved
# with the Model Catalog 
td_retrieved_DF_model = retrieve_model("ak_TBv2_Py_CC_DF_model")

In [None]:
# Assume that on the basis of the earlier model comparison, we choose to keep
# the Decision Forests model and discard the XGBoost one.
#
# The publish_model() function enables sharing the selected models with
# other users, and specifying a status among the available options
# of "In-Development", "Candidate", "Active", "Production", and "Retired".
#
publish_model("ak_TBv2_Py_CC_DF_model", grantee = "public", status = "Active")

In [None]:
# Discarding a model no longer needed.
#
delete_model("ak_TBv2_Py_CC_DF_model")
delete_model("ak_TBv2_Py_CC_XGB_model")

### End of session

In [None]:
# Remove the context of present teradataml session and terminate the Python
# session. It is recommended to call the remove_context() function for session
# cleanup. Temporary objects are removed at the end of the session.
#
remove_context()