# TechBytes: Using Python with Teradata Vantage
## Part 4: In-Database scripting with the SCRIPT Table Operator - Map functions

The contents of this file are Teradata Public Content and have been released to the Public Domain.
Please see _license.txt_ file in the package for more information.

Alexander Kolovos and Tim Miller - May 2021 - v.2.0 \
Copyright (c) 2021 by Teradata \
Licensed under BSD

This TechByte demonstrates a few different ways in which teradataml helps you use the SCRIPT Table Operator (STO) Database object for native execution of Python scripts inside the Database. Specifically, this TechByte shows how to
* test a Python script on the client for correct in-Database execution with the teradataml STO Sandbox feature.
* bring into the Database a Python model you have previously trained, and use it with a scoring script for in-Database scoring.
* scale in-Database tasks of training and scoring with multiple models when using partitioned data.
* use the teradataml DataFrame Map functions; namely, use the map_row() method for row-based operations and the map_partition() method for partition-based operations.

_Note_: To perform in-nodes script execution, you need to coordinate with your Database Administrator (DBA), and ensure that (a) the STO Database object is activated in the Advanced SQL Engine, (b) the Teradata In-Nodes Python Interpreter and Add-ons packages are installed in the target server, and (c) your Database user account has the necessary STO permissions enabled by the DBA.

Contributions by:
- Alexander Kolovos, Sr Staff Software Architect, Teradata Product Engineering / Vantage Cloud and Applications.
- Tim Miller, Principal Software Architect, Teradata Product Management / Advanced Analytics.

### Initial Steps: Load libraries and create a Vantage connection

In [None]:
# Load teradataml and dependency packages.
#
import os
import getpass as gp

from teradataml import create_context, remove_context, get_context
from teradataml import DataFrame, copy_to_sql, in_schema
from teradataml.options.display import display

from teradataml.table_operators.Script import Script
from teradataml.table_operators.sandbox_container_util import *
from teradatasqlalchemy import (VARCHAR, INTEGER, FLOAT, CLOB)

import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
import pickle
import base64

In [None]:
# Specify a Teradata Vantage server to connect to. In the following statement, 
# replace the following argument values with strings as follows:
# <HOST>   : Specify your target Vantage system hostname (or IP address).
# <UID>    : Specify your Database username.
# <PWD>    : Specify your password. You can also use encrypted passwords via
#            the Stored Password Protection feature.
#con = create_context(host = <HOST>, username = <UID>, password = <PWD>, 
#                     database = <DB_Name>, "temp_database_name" = <Temp_DB_Name>)
#
con = create_context(host = "<Host_Name>", username = "<Username>",
                            password = gp.getpass(prompt='Password:'), 
                            logmech = "LDAP", database = "TRNG_TECHBYTES",
                            temp_database_name = "<Database_Name>")

In [None]:
# Create a teradataml DataFrame from the ADS we need, and take a glimpse at it.
#
td_ADS_Py = DataFrame("ak_TBv2_ADS_Py")
td_ADS_Py.to_pandas().head(5)

In [None]:
# Split the ADS into 2 samples, each with 60% and 40% of total rows.
# Use the 60% sample to train, and the 40% sample to test/score.
# Persist the samples as tables in the Database.
#
td_Train_Test_ADS = td_ADS_Py.sample(frac = [0.6, 0.4])

Train_ADS = td_Train_Test_ADS[td_Train_Test_ADS.sampleid == "1"]
copy_to_sql(Train_ADS, table_name="ak_TBv2_Train_ADS_Py", if_exists="replace")

Test_ADS = td_Train_Test_ADS[td_Train_Test_ADS.sampleid == "2"]
copy_to_sql(Test_ADS, table_name="ak_TBv2_Test_ADS_Py", if_exists="replace")

### 1. BYOM: In-Database scoring with external Python model
In this segment, we illustrate training a Random Forests classification model on the client, and saving it as a binary file. Eventually, we send the model together with a scoring script to the Database, and use them in the Database environment to scale a scoring task of a test dataset. In between, we demonstrate testing of the scoring script outside the Database in the teradataml STO Sandbox.

#### 1.1. Random Forests classification model training on the client

In [None]:
# Start with the training subset of the ADS. First, read it from the table.
# Then, convert to pandas DataFrame to enable subsequent modeling operations.
#
td_Train_ADS = DataFrame("ak_TBv2_Train_ADS_Py")

df_Train_ADS = td_Train_ADS.to_pandas()
df_Train_ADS.head()

In [None]:
# Determine the columns that the predictor accounts for.
#
predictor_columns = ["income", "age", "tot_cust_years", "tot_children",
                     "female_ind", "single_ind", "married_ind", "separated_ind",
                     "ck_acct_ind", "sv_acct_ind", "ck_avg_bal", "sv_avg_bal",
                     "ck_avg_tran_amt", "sv_avg_tran_amt", "q1_trans_cnt",
                     "q2_trans_cnt", "q3_trans_cnt", "q4_trans_cnt"]

# Note: At time of creation of this TechByte, in-nodes Python has RF classifier
#       from the scikit-learn add-on library v.0.22.2.post1. Keep an eye for  
#       potential incompatibilites, if a package version on your client should
#       differ from the in-nodes add-on version. 
#       In the present TechByte, the client carries scikit-learn add-on library
#       v.0.23.2. No issues were observed when using a model built with this 
#       later version for in-nodes scoring.
#       In case errors may be produced due to different scikit-learn versions 
#       on the client and in-nodes, you can try switching your client's add-on
#       scikit-learn version to match in-nodes by using the explicit command:
#       "pip install scikit-learn==<version>"

# For the classifier, specify the following parameters:
# ntree: n_estimators=500, mtry: max_features=5, nodesize: min_samples_leaf=1 (default; skipped)
#
classifier = RandomForestClassifier(n_estimators=500, max_features=5, random_state=0)
X = df_Train_ADS[predictor_columns]
y = df_Train_ADS["cc_acct_ind"]

# Train the Random Forest model to predict Credit Card account ownership based upon specified independent variables.
#
classifier = classifier.fit(X, y)

print("Model training complete.")

In [None]:
# Save model into file in present client folder where the notebook executes.
# Note: In the following, we are using both pickle (to serialize) and base64
#       (to encode) the model prior to saving it into a file. If model is only
#       pickled, then unplickling in the Database might produce a pickle
#       AttributeError that claims an "X object has no attribute Y". This is 
#       related to namespaces in client and target systems. See more info at: 
#       https://docs.python.org/3/library/pickle.html#pickling-class-instances
#
filePath = "<your/path/to/folder/to/store/model/>"
modelFileName = "RFmodel_py.out"
classifierPkl = pickle.dumps(classifier)
classifierPklB64 = base64.b64encode(classifierPkl)
with open(filePath + modelFileName, 'wb') as fOut:            # Write in binary format
    fOut.write(classifierPklB64)

print("Model saved in file '" + filePath + modelFileName + "'.")

#### 1.2. Setting up the teradataml Script object and STO Sandbox

In [None]:
# Here is the path where we keep necessary files for this demo.
#
path_to_files = "<your/path/to/folder/with/input/files/>"
#
# Request to print the SQL submitted to the Advanced SQL Engine.
#
display.print_sqlmr_query = True

# Set SQL SEARCHUIFDBPATH to database where script-related files are installed.
#
con.execute("SET SESSION SEARCHUIFDBPATH = TRNG_TECHBYTES;") 

In [None]:
# Specify the teradataml DataFrame to use with the teradataml Script object.
#
td_Test_ADS = DataFrame("ak_TBv2_Test_ADS_Py")

# Recall that the testing subset table has an additional column with the sample
# ID. We drop it because our script does not account for the extra column.
td_Test_ADS = td_Test_ADS.drop(columns = "sampleid")

# To test the script in the STO Sandbox, we would like to have a sample of
# the test dataset handy for use in the STO Sandbox. To this end, we bring 500
# rows of the test dataset from the Database, and save them into a csv file.
# Note: Exclude the DataFrame index from the csv file; this prevents the index
#       from being assumed to be an additional data column in the csv file. 
#
df_Test_ADS_Sample = td_Test_ADS.to_pandas(num_rows = 500)
df_Test_ADS_Sample.to_csv(path_to_files + "stoSandboxTestData.csv", index = False)
df_Test_ADS_Sample.head(5)

In [None]:
# Define a teradataml Script object. We will use this object with calls
# to the STO Sandbox for testing and validation of the Python scoring script
# before the script is sent to the Database.
# Note: In present use case, the Python script will be importing a model file.
#       Remember to adjust the relative model file location in the Python code.
#       In the STO Sandbox, the model is expected to be found in the same,
#       current Sandbox directory where the script is placed, too.
#
stoSB = Script(data = td_Test_ADS,
               script_name = "stoRFScoreSB.py",
               files_local_path = path_to_files, 
               script_command = "python3 ./TRNG_TECHBYTES/stoRFScoreSB.py",
               delimiter = ',',
               returns = { "ID": INTEGER(), "Prob_0": FLOAT(), "Prob_1": FLOAT(), "Actual": INTEGER() }
              )

In [None]:
# Set up the STO Sandbox with the Python STO Sandbox Docker image available
# at downloads.teradata.com. Currently, when specifying the sandbox by image
# location, then the "sandbox_image_name" must be specified, too, and must be
# "stosandbox:1.0".
#
sb_path = "<your/path/to/folder/where/sandbox/image/resides/>"
setup_sandbox_env(sandbox_image_location = sb_path + "sto_sandbox_Python3.7.7_sles12sp3.0.5.4_docker_image.1.0.0.tar.gz",
                  sandbox_image_name = "stosandbox:1.0")

configure.sandbox_container_id

#### 1.3. Script code testing in the STO Sandbox

In [None]:
testOut = stoSB.test_script(input_data_file = "stoSandboxTestData.csv",
                            supporting_files = "RFmodel_py.out"
                           )
testOut.head(n = 5)

In [None]:
# Clean up the STO sandbox
#
cleanup_sandbox_env()

#### 1.4. Script code execution in the Database with the STO

In [None]:
# Define a different teradataml Script object. We will use this object with
# calls to the STO for execution of the Python scoring script in-Database.
# Note: In present use case, the Python script will be importing a model file.
#       Remember to adjust the relative model file location in the Python code.
#       In the Database, the model file is expected to be found inside the
#       node directory that is named after the SEARCHUIFDBPATH.
# In general, there is no need to define different teradataml Script objects
# when the script code is identical for Sandbox and in-Database use.
#
sto = Script(data = td_Test_ADS,
             script_name = "stoRFScore.py",
             files_local_path = path_to_files, 
             script_command = "python3 ./TRNG_TECHBYTES/stoRFScore.py",
             delimiter = ',',
             returns = { "ID": INTEGER(), "Prob_0": FLOAT(), "Prob_1": FLOAT(), "Actual": INTEGER() }
            )

In [None]:
# If previous versions of files exist in the Database, remove them prior to
# installing the current versions you wish to use.
#
sto.remove_file(file_identifier='RFmodel_py', force_remove=True)
sto.remove_file(file_identifier='stoRFScore', force_remove=True)

# Install the script and the accompanying model file in the target Advanced SQL
# Engine Database. Remember to specify in your script code the correct path to
# the model file in the Database: Your script will be looking for the model
# file in a node directory named after the SEARCHUIFDBPATH.
#
sto.install_file(file_identifier='RFmodel_py', file_name='RFmodel_py.out', is_binary=True)
sto.install_file(file_identifier='stoRFScore', file_name='stoRFScore.py', is_binary=False)

In [None]:
# Execute the script in-Database with the SCRIPT Table Operator.
#
sto.execute_script()

### 2. Micromodeling: Scaled, In-Database training and scoring of multiple models
When you need to train a different model for each value of a feature and then score corresponding data, Vantage and teradataml can help you scale the entire operation by training and scoring multiple models in parallel in the Advanced SQL Engine.

In [None]:
# Here is the path where we keep necessary files for this demo.
#
path_to_files = "<your/path/to/folder/with/input/files/>"
#
# Request to print the SQL submitted to the Advanced SQL Engine.
#
display.print_sqlmr_query = True

# Set SQL SEARCHUIFDBPATH to database where script-related files are installed.
#
con.execute("SET SESSION SEARCHUIFDBPATH = TRNG_TECHBYTES;") 

#### 2.1. Models training with the STO

In [None]:
# Step 1: Model training
# 
# Start with the training subset of the ADS. First, read it from the table.
# Then, convert to pandas DataFrame to enable subsequent modeling operations.
#
td_Train_ADS = DataFrame("ak_TBv2_Train_ADS_Py")

# Recall that the training subset table has an additional column of the sample
# ID. We drop it because our script does not account for the extra column.
td_Train_ADS = td_Train_ADS.drop(columns = "sampleid")
td_Train_ADS.to_pandas().head(5)

In [None]:
# Determine the teradataml Script object for the training segment. This object
# will be used to call the SCRIPT Table Operator in the Database. Specify that
# we want to run the training script on data that should be partitioned by the
# state code variable.
# Note: In the present implementation, the output column names are used by the
#       scoring script; handle naming carefully to maintain consistency with
#       code and avoid errors during execution.
#
stoTr = Script(data = td_Train_ADS,
               script_name = "stoRFFitMM.py",
               files_local_path = path_to_files, 
               script_command = "python3 ./TRNG_TECHBYTES/stoRFFitMM.py",
               data_partition_column = "state_code",
               delimiter = ',',
               returns = { "State_Code": VARCHAR(10), "Model": CLOB() }
              )

In [None]:
# If previous versions of files exist in the Database, remove them prior to
# installing the current versions you wish to use.
#
stoTr.remove_file(file_identifier='stoRFFitMM', force_remove=True)

# Install the script and the accompanying model file in the target Advanced SQL
# Engine Database.
#
stoTr.install_file(file_identifier='stoRFFitMM', file_name='stoRFFitMM.py', is_binary=False)

In [None]:
# Execute the script in-Database with the SCRIPT Table Operator.
#
trainOutObj = stoTr.execute_script()
print("STO call complete.")

In [None]:
# The trained models are now pointed to by the trainOutObj object.
#
# In the present illustration, we show how to use multipe trained models by
# bringing them locally to the client and then uploading them as a file to
# the Database. A different approach is to store the trained models directly
# into a table in the Database. The latter approach is exhibited in the
# section "Using DataFrame.map_partition() Function for GLM Model Fitting and
# Scoring Functions" of the teradataml User Guide at docs.teradata.com.
#
# Save the trained models in a local csv file. The scoring script will need
# this file to select and use the appropriate model with the corresponding 
# state code partition of test data on each Database AMP.
#
# To save the models into a file on the client, we first bring them locally
# by converting the teradataml DataFrame to a pandas DataFrame.
#
multipleModels = trainOutObj.to_pandas()
multipleModels.head()

In [None]:
# Then we simply export the pandas DataFrame to a csv file.
# Note: Exclude the DataFrame index from the csv file; this prevents the index
#       from being assumed by the Database to be an additional data column in
#       the csv file. 
#
multipleModels.to_csv(path_to_files + 'multipleModels_py.csv', index = False)

#### 2.2. Models scoring with the STO

In [None]:
# Step 2: Scoring with Models
# 
# Specify the teradataml DataFrame to use with the teradataml Script object.
#
td_Test_ADS = DataFrame("ak_TBv2_Test_ADS_Py")

# Recall that the testing subset table has an additional column with the sample
# ID. We drop it because our script does not account for the extra column.
#
td_Test_ADS = td_Test_ADS.drop(columns = "sampleid")
td_Test_ADS.to_pandas().head(5)

In [None]:
# Determine the teradataml Script object. This object will be used with calls
# to the SCRIPT Table Operator in the Database, as well as with STO script 
# testing in the STO Sandbox.
# Note: In present use case, the Python script will be importing a model file.
#       Remember to adjust the relative model file location in the Python code.
#       In the Database, the model file is expected to be found inside the
#       node directory that is named after the SEARCHUIFDBPATH.
#
stoSc = Script(data = td_Test_ADS,
               script_name = "stoRFScoreMM.py",
               files_local_path = path_to_files, 
               script_command = "python3 ./TRNG_TECHBYTES/stoRFScoreMM.py",
               data_partition_column = "state_code",
               delimiter = ',',
               returns = { "State_Code": VARCHAR(10), "Cust_ID": INTEGER(), 
                           "Prob_0": FLOAT(), "Prob_1": FLOAT(), "Actual": INTEGER() }
              )

In [None]:
# If previous versions of files exist in the Database, remove them prior to
# installing the current versions you wish to use.
#
#stoSc.remove_file(file_identifier='multipleModels_py', force_remove=True)
stoSc.remove_file(file_identifier='stoRFScoreMM', force_remove=True)

# Install the script and the accompanying model file in the target Advanced SQL
# Engine Database.Remember to specify in your script the correct path to the
# model file in the Database: Your script needs to look for the script file in
# a directory named after the SEARCHUIFDBPATH.
#
#stoSc.install_file(file_identifier='multipleModels_py', file_name='multipleModels_py.csv', is_binary=False)
stoSc.install_file(file_identifier='stoRFScoreMM', file_name='stoRFScoreMM.py', is_binary=False)

In [None]:
# Execute the script in-Database with the SCRIPT Table Operator.
#
stoSc.execute_script()

### 3. Map functions
A different way to execute row-based or partition-based operation in the Advanced SQL Engine is via the teradataml DataFrame map_row() and map_partition() methods, respectively. In essence, these methods streamline the set-up and calls to the STO in the background, thus making interaction with the STO transparent to the Python user.

**Caution:** For map_row() and map_partition() to work, teradataml requires the Python _dill_ add-on library version to be same on both the client and the target Advanced SQL Engine.

In the following segment, we present brief examples of the map_row() and map_partition() methods. An additional micromodeling use case example with map_partition() can be found in the section "Using DataFrame.map_partition() Function for GLM Model Fitting and Scoring Functions" of the teradataml User Guide at docs.teradata.com. 

In [None]:
from teradataml import load_example_data
from collections import OrderedDict

# This example uses the 'admissions_train' dataset, and calculates the average
# 'gpa' per partition based on the value in 'admitted' column.
#
# Load the example data. Observe that the load_example_data() function creates
# internal/temp tables, and for this reason it places them in the teradataml 
# context "temp_database_name" database, which presently is "<Database_Name>".
# However, the teradataml DataFrame() function looks by default in the
# teradataml context "database". Therefore, to create a teradataml DataFrame
# in this case, the DataFrame() function must be pointed explicitly to look
# into the "<Database_Name>" database, as shown in the following.
#  
load_example_data("dataframe", "admissions_train")
dfMap = DataFrame(in_schema('<Database_Name>', 'admissions_train'))
dfMap.to_pandas().head(5)

In [None]:
# A user defined function to increase the 'gpa' by a specified percentage.
# Both the function input and output are pandas Series objects.
#
def increase_gpa(row, p = 20):
    row['gpa'] = row['gpa'] + row['gpa'] * p/100
    return row

# Apply the user defined function to the teradataml DataFrame. The output of 
# the user defined function expects the same columns with the same types as
# the input, hence the 'returns' argument of map_row() can be skipped.
#
increase_gpa_10 = dfMap.map_row(lambda row: increase_gpa(row, p = 10))
#
# Note: map_row() can ve also called with only the user-defined function as an
#       argment; alternatively, it can be invoked with partial notation, too:
#       increase_gpa_40 = df.map_row(lambda row: increase_gpa(row, p = 40))
#       from functools import partial
#       increase_gpa_50 = df.map_row(partial(increase_gpa, p = 50))
#
increase_gpa_10.to_pandas().head(5)

In [None]:
# A user defined function to calculate the average 'gpa', by reading data at
# once into a pandas DataFrame. The function accepts a TextFileReader object
# for data iteration in chunks. The function returns a pandas Series.
#
def grouped_gpa_avg(rows):
    pdf = rows.read()
    if pdf.shape[0] > 0:
        return pdf[['admitted', 'gpa']].mean()

# Apply the user defined function to the DataFrame.
#
avg_gpa_pdf = dfMap.map_partition(
                        grouped_gpa_avg,
                        returns = OrderedDict([('admitted', INTEGER()), ('avg_gpa', FLOAT())]),
                        data_partition_column = 'admitted'
                                 )
avg_gpa_pdf.to_pandas()

### End of session

In [None]:
# Remove the context of present teradataml session and terminate the Python
# session. It is recommended to call the remove_context() function for session
# cleanup. Temporary objects are removed at the end of the session.
#
remove_context()