<header>
   <p  style='font-size:36px;font-family:Arial; color:#F0F0F0; background-color: #00233c; padding-left: 20pt; padding-top: 20pt;padding-bottom: 10pt; padding-right: 20pt;'>
       Prerequisites for Demonstration
    </p>
</header>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>Python Packages</b>. Depending on the environment, additional packages may be needed.  At the very minimum, this demo requires the following.  Install manually or execute the code cells that follow.<p>
<ol style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>Install teradataml >= 17.20.0.2</li>
    <li>Install jdk4py and xgboost</li>
    </ol>

In [None]:
# install other required packaged
!pip install xgboost jdk4py

In [None]:
# Force the install of the proper teradataml library in case the standard library does not work
# Note: This should not be necessary but just in case
!pip install -U teradataml

<header>
   <p  style='font-size:36px;font-family:Arial; color:#F0F0F0; background-color: #00233c; padding-left: 20pt; padding-top: 20pt;padding-bottom: 10pt; padding-right: 20pt;'>
       Leveraging ClearScape Analytics to Predict Customer Behavior
  <br>
       <img id="teradata-logo" src="../../images/TeradataLogo.png" alt="Teradata" style="width: 125px; height: auto; margin-top: 20pt;">
    </p>
</header>
<hr>

<br>

<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>How to leverage VantageCloud Lake to generate powerful predictive models by combining disparate data sets, analyzing the customer journey, and utilizing open source tools and technologies.</b></p>

<br>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Practitioners can combine data from a myriad of sources, and apply both scalable native functions and open-source tools and techniques to create powerful predictive models.  This process can be done at any scale with minimal data movement and maximum efficiency.  These models can then be deployed operationally to run with enterprise performance/SLA conformance and concurrency.</p>

<hr>

<b style = 'font-size:28px;font-family:Arial;color:#00233C'>Environment and Demo Overview</b>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>This demonstration utilizes a VantageCloud Lake <b>Analytic Cluster</b> architecture, using several data sets sourced from various locations:</p>

<ol style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li><b>Block Filesystem</b> Tables which contain Customer Dimension data</li>
    <li><b>Object Filesystem</b> Tables which contain user comment history</li>
    <li><b>Object Filesystem</b> Tables which contain user activity history</li>
    </ol>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The high level process is as follows:</p>

<table style = 'width:100%;table-layout:fixed;'>
    <tr><td style = 'vertical-align:top' width = '40%'>
            <ol style = 'font-size:16px;font-family:Arial;color:#00233C'>
                <li>The Data Scientist connects to VantageCloud Lake through teradataml client library and conducts analytics activities using familiar tools and syntax to:
                 <ul><li>Explore Data</li>
                     <li>Engineer Features and connect them together</li>
                     <li>Train and evaluate a model with tools of choice</li>
                    </ul></li>
                <br>
                <li>Connect to the Custom Runtime Environment, or create a new one if needed</li>
                <br>
                <li>Execute the production pipeline that will;
                    <ul>
                        <li>Deploy the model to a Custom Runtime Container</li>
                        <li>Pass prepared data to the python container running in parallel on cluster nodes.</li>
                        <li>Results (inference/predictions) are returned as "virtual" dataframes; where the data resides in Vantage</li>
                        <li>Data can be persisted in the Object Filesystem, written to open object storage, or copied to the client</li>
                    </ul></li>
            </ol>
        </td><td><img src = 'images/OAF_Overview.png' width = '600'></td></tr>
</table>



<hr>

<b style = 'font-size:18px;font-family:Arial;color:#00233C'>Python Package Imports</b>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Standard practice to import required packages and libraries; execute this cell to import packages for Teradata automation as well as machine learning, analytics, utility, and data management packages.</p> 

In [None]:
import warnings
warnings.filterwarnings('ignore')


import pickle as pkl
import datetime as dt
import json, os, jdk4py

from teradataml import *
from teradatasqlalchemy import types

from IPython.display import display as ipydisplay
from IPython.display import clear_output 
from time import sleep
from sklearn.metrics import accuracy_score, roc_auc_score, confusion_matrix, ConfusionMatrixDisplay

import matplotlib.pyplot as plt
%matplotlib inline

from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly.offline as offline
from matplotlib import cm
import plotly.express as px
import plotly.graph_objects as go

#
# Account for the proper version if urllib for parsing
#
try:
    # Python 2.x should not be needed
    from urllib.parse import urlparse
except ImportError:
    # Python 3.x should be standard
    from urlparse import urlparse

offline.init_notebook_mode()
from teradataml import display
display.suppress_vantage_runtime_warnings = True

configure.val_install_location = 'TD_VAL'
configure.byom_install_location = 'TD_MLDB'

In [None]:
# Function definition #
# This can be hidden during the demo by clicking
# on the vertical bar on the left of the editor window

#Convert Teradata nPath output to plotly Sankey
#can handle paths up to 999 links in length

def display_sankey(npath_pandas):
    
    dataDict = defaultdict(int)
    eventDict = defaultdict(int)
    maxPath = 3


    for index, row in npath_pandas.iterrows():
        rowList = row['path'].replace('[','').replace(']','').split(',')
        pathCnt = 3
        pathLen = len(rowList)
        for i in range(len(rowList)-1):
            leftValue = str(100 + i + maxPath - pathLen) + rowList[i].strip()
            rightValue = str(100 + i + 1 + maxPath - pathLen) + rowList[i+1].strip()
            valuePair = leftValue + '+' + rightValue
            dataDict[valuePair] += pathCnt
            eventDict[leftValue] += 1
            eventDict[rightValue] += 1

    eventList = []
    for key,val in eventDict.items():
        eventList.append(key)

    sortedEventList = sorted(eventList)
    sankeyLabel = []
    for event in sortedEventList:
        sankeyLabel.append(event[3:])

    sankeySource = []
    sankeyTarget = []
    sankeyValue = []

    for key,val in dataDict.items():
        sankeySource.append(sortedEventList.index(key.split('+')[0]))
        sankeyTarget.append(sortedEventList.index(key.split('+')[1]))
        sankeyValue.append(val)

    sankeyColor = []
    for i in sankeyLabel:
        sankeyColor.append('blue')

    sankeyChart = dict(
        type='sankey',
        node = dict(
          pad = 15,
          thickness = 20,
          line = dict(
            color = 'black',
            width = 0.5
          ),
          label = sankeyLabel,
          color = sankeyColor
        ),
        link = dict(
            source = sankeySource,
            target = sankeyTarget,
            value = sankeyValue
        )
      )
    layout =  dict(
        title = "Paths to Cancel",
        font = dict(
          size = 10
        )
    )


    fig = dict(data=[sankeyChart], layout=layout)
    return fig

In [None]:
# load vars json
with open('../../vars.json', 'r') as f:
    session_vars = json.load(f)

# Create the SQLAlchemy Context
host = session_vars['environment']['host']
username = session_vars['hierarchy']['users']['business_users'][1]['username']
password = session_vars['hierarchy']['users']['business_users'][1]['password']

# This URL comes from the environment panel in console; provided here in the environment
# object in vars.json
ues_url = session_vars['environment']['UES_URI']

eng = create_context(host=host, username=username, password=password)

# confirm connection
print(eng)

<hr>
<p style = 'font-size:24px;font-family:Arial;color:#00233C'><b>Demo 1 - Feature Engineering at Scale</b></p>

<table style = 'width:100%;table-layout:fixed;'>
<tr>
    <td style = 'vertical-align:top' width = '50%'>
        <p style = 'font-size:16px;font-family:Arial;color:#00233C'>Leverage Vantage for fast, efficient data transformations to create reusable features.</p>
        <ol style = 'font-size:16px;font-family:Arial;color:#00233C'>
            <li>Connect to data through Vantage - Local/Remote/Data Lake</li>
            <br>
            <li>Feature Engineering - Customer Journey</li>
            <br>
            <li>Feature Engineering - Sentiment Analysis</li>
            <br>
            <li>Create Training and Testing data sets</li>
        </ol>
    </td><td width = '20%'></td>
    <td><img src = 'images/thumbsupdown.png' width = '200'></td>
</tr>
</table>

<hr>
<b style = 'font-size:18px;font-family:Arial;color:#00233C'>1.1 - Inspect the Data</b>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Create Three Virtual DataFrames:</p>

<ol style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>Customer Activity from OFS Storage</li>
    <li>Customer Comments from OFS Storage</li>
    <li>Customer PII from BFS Storage</li>
    </ol>

In [None]:
# Activity History:
tdf_customer_history_OFS = DataFrame('"demo_ofs"."retail_compnew"')

print('Activity History:')
ipydisplay(tdf_customer_history_OFS.head(3))

# Comment History
tdf_customer_comments_OFS = DataFrame('"demo_ofs"."web_comment"')
print('Comment History:')
ipydisplay(tdf_customer_comments_OFS.head(3))


# Customer Dimension - use VALIDTIME
tdf_customer_info_BFS = DataFrame.from_query('current validtime select * from demo.retail_customer')
print('Customer Information:')
ipydisplay(tdf_customer_info_BFS.head(10))

<hr>
<b style = 'font-size:18px;font-family:Arial;color:#00233C'>1.2 - Customer Journey</b>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Create predictive features based on user behavior over time:</p>

<ol style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>Sessionize based on time (daily)</li>
    <li>Create "Prior Three" events table</li>
    <li>Encode the events as numeric columns</li>
    </ol>
    
<hr>

<b style = 'font-size:18px;font-family:Arial;color:#00233C'>Sessionize and NPath</b>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Unique Analytic capabilities to perform advanced windowing and pattern matching.</p> 

In [None]:
# Sessionize will take time-ordered rows and assign user sessions.

td_sessionize_out = Sessionize(data = tdf_customer_history_OFS,
                               data_partition_column = ["CUSTOMER_ID"],
                               data_order_column = ["DATESTAMP"],
                               time_column = "DATESTAMP",
                               time_out = 86400.0 # 24 HOURS IS THE SESSIONIZATION TIME 
                              )
            
tdf_sessionized = td_sessionize_out.result.assign(drop_columns=True, 
                             customer_id = td_sessionize_out.result['CUSTOMER_ID'], 
                             datestamp = td_sessionize_out.result['DATESTAMP'], 
                             event = td_sessionize_out.result['EVENT'],
                             session_id = td_sessionize_out.result['SESSIONID'])

print('Sessionized Data:')
ipydisplay(tdf_sessionized.head(5))

# Use NPath to create a "path" of user events
# Then look 'backwards' to see events that led up to the final one 
# Say - cancellation

from teradataml import NPath

tdf_path = NPath(data1 = tdf_sessionized, 
           mode = 'NONOVERLAPPING', 
           data1_partition_column = ['customer_id', 'session_id'], 
           data1_order_column = ['datestamp'], 
           symbols = ['True as A'], 
           pattern = '(A){3}$', 
           result = ['FIRST (customer_id OF A) AS customer_id',
                     'FIRST (session_id OF A) AS session_id',
                     'NTH (event, 1 OF A) AS prior_2_event',
                     'NTH (event, 2 OF A) as prior_1_event',
                     'LAST (event OF A) as final_event', 
                     'ACCUMULATE(event OF ANY(A)) AS PATH']).result

print('Path Data:')
ipydisplay(tdf_path.head(5))

In [None]:
# Visualize the paths leading to Cancel

iplot(display_sankey(tdf_path[tdf_path['final_event'] == 'Mem Cancel'].to_pandas(all_rows = True)), validate = False)

<hr>

<b style = 'font-size:18px;font-family:Arial;color:#00233C'>Vantage Analytic Library</b>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Whole-data set statistical analysis and data transformation functions.  One-hot Encode the event columns, and binary encode the final event "Cancel" vs. "Not Cancel"</p> 

In [None]:
# Create Transformation and column retention object(s)

import teradataml.analytics.Transformations as tdtf


event_1_list = tdf_path.groupby('prior_1_event').count().to_pandas()['prior_1_event'].to_list()
event_2_list = tdf_path.groupby('prior_2_event').count().to_pandas()['prior_2_event'].to_list()


event_1_tf = tdtf.OneHotEncoder(values = [x for x in event_1_list if x], columns = 'prior_1_event')
event_2_tf = tdtf.OneHotEncoder(values = [x for x in event_2_list if x], columns = 'prior_2_event')

rt = Retain(columns = ['final_event'])

# Execute the Transformation on the whole dataset

tdf_path_encoded = valib.Transform(data = tdf_path,
                                   one_hot_encode = [event_1_tf, event_2_tf],
                                   retain = [rt],
                                   index_columns = ['customer_id', 'session_id']).result

tdf_path_encoded = tdf_path_encoded.assign(b_cancel = tdf_path_encoded['final_event'].str.contains('Mem Cancel')).drop('final_event', axis = 1)
tdf_path_encoded.head(5)

<hr>
<b style = 'font-size:18px;font-family:Arial;color:#00233C'>1.3 - Derive Sentiment Features</b>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Analyze Customer Comments for sentiment, and create features based on polarity:</p>

<ol style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>Extract Sentiment</li>
    <li>Encode Polarity values (NEG, NEU, POS)</li>
    <li>Aggregate and rescale polarity values</li>
    </ol>

In [None]:
from teradataml import SentimentExtractor, ConvertTo, OneHotEncodingFit, OneHotEncodingTransform, ScaleFit, ScaleTransform

print('1. Original Data Set:')
ipydisplay(tdf_customer_comments_OFS.head(2))

sentiment_res = SentimentExtractor(data = tdf_customer_comments_OFS, 
                                   text_column = 'comment_text', 
                                   accumulate = ['customer_id','comment_summary']).result

print('2. Sentiment:')
ipydisplay(sentiment_res.head(2))

# Perform native one-hot encoding on the data
# These functions use a "fit-and-transform" pattern
# that supports reuse and easier operationalization of the transformation process

res_ohe = OneHotEncodingFit(data = sentiment_res, 
                            target_column = 'polarity', 
                            categorical_values = ['POS', 'NEG', 'NEU'], 
                            other_column = 'other',
                            is_input_dense = True)

tdf_sentiment_transformed = OneHotEncodingTransform(data = sentiment_res, object = res_ohe.result, is_input_dense = True).result

print('3. One-Hot Encoding:')
ipydisplay(tdf_sentiment_transformed.head(2))

#Aggregate the polarity scores:

tdf_gb = tdf_sentiment_transformed.groupby('customer_id').sum().drop(['sum_polarity_other', 'sum_sentiment_score'], axis = 1)

print('4. Aggregate Sentiment Polarity:')
ipydisplay(tdf_gb.head(2))


# Convert data types and rescale
tdf_gb_cv = ConvertTo(data = tdf_gb, 
                      target_columns = ['0:3'], 
                      target_datatype = 'integer').result


sf_fit = ScaleFit(data = tdf_gb_cv, 
                  scale_method = 'RESCALE (lb=0, ub=1)',
                  target_columns = ['1:3'])

tdf_rescaled_sentiment = ScaleTransform(data = tdf_gb_cv,
                                      object = sf_fit.output, 
                                      accumulate = ['customer_id']).result


tdf_rescaled_sentiment = tdf_rescaled_sentiment.assign(customer_id = tdf_rescaled_sentiment['customer_id'] + 1526)
res = copy_to_sql(tdf_rescaled_sentiment, table_name = 'agg_sentiment', schema_name = 'demo_ofs', if_exists = 'replace')

print('5. Rescaled:')
ipydisplay(tdf_rescaled_sentiment.head(5))

<hr>
<b style = 'font-size:18px;font-family:Arial;color:#00233C'>1.4 - Create Final Data Set</b>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Join features into a single analytic set:</p>

<ol style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>Join Gender info from Customer Table </li>
    <li>Join Sentiment Polarity DataFrame</li>
    <li>Remove NULLs</li>
    <li>Train/Test Split</li>
    </ol>

In [None]:
tdf_joined = tdf_path_encoded.join(tdf_customer_info_BFS[['CUSTOMER_ID','GENDER']], on=['customer_id'], lsuffix='l', rsuffix='cust', how='inner')

In [None]:
tdf_joined

In [None]:
# Join the Gender category from Delta Lake
tdf_joined = tdf_path_encoded.join(tdf_customer_info_BFS[['CUSTOMER_ID','GENDER']], on=['customer_id'], lsuffix='l', rsuffix='cust', how='inner')
tdf_joined = tdf_joined.assign(b_gender = tdf_joined['GENDER'].str.contains('M'), 
                               customer_id = tdf_joined['customer_id_cust']).drop(['GENDER', 'CUSTOMER_ID_l', 'customer_id_cust'], axis = 1)
print('1. Gender Flag Joined to Events:')
ipydisplay(tdf_joined.head(5))

# Join the Sentiment Polarity
# All customers didn't leave comments, so many nulls need to be filled

tdf_analytic_set = tdf_joined.join(tdf_rescaled_sentiment, on = ['customer_id'], lsuffix = 'l', rsuffix = 'r')
tdf_analytic_set = tdf_analytic_set.assign(customer_id = tdf_analytic_set['customer_id_l']).drop(['customer_id_l', 'customer_id_r'], axis = 1)

fillna_fit = tdtf.FillNa(style = 'literal', value=0, columns=['sum_polarity_POS', 'sum_polarity_NEG', 'sum_polarity_NEU'])
rt = rt = Retain(columns = [x for x in tdf_analytic_set.columns if x not in ['sum_polarity_POS', 'sum_polarity_NEG', 'sum_polarity_NEU', 'customer_id']])

tdf_analytic_set = valib.Transform(data = tdf_analytic_set,
                                   fillna = fillna_fit,
                                   key_columns = 'customer_id', 
                                   index_columns = 'customer_id',
                                   retain = [rt]).result

print('2. Polarity Added, NULLs filled:')
ipydisplay(tdf_analytic_set.head(5))

print('Check Distribution of churn vs. non-churn:')
ipydisplay(tdf_analytic_set.groupby('b_cancel').count().to_pandas()[['b_cancel','count_customer_id']])

<hr>

<b style = 'font-size:18px;font-family:Arial;color:#00233C'>Test/Train Split</b>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Extraordinarily fast "Sample" function can split the data into multiple data sets in seconds.</p>

In [None]:
tdf_samples = tdf_analytic_set.sample(frac = [0.2, 0.8])
copy_to_sql(tdf_samples[tdf_samples['sampleid'] == 2], table_name = 'churn_train', schema_name = 'demo', if_exists = 'replace')
copy_to_sql(tdf_samples[tdf_samples['sampleid'] == 1], table_name = 'churn_test', schema_name = 'demo', if_exists = 'replace')

<hr>

<b style = 'font-size:18px;font-family:Arial;color:#00233C'>Train the Model</b>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Use open-source XGBoost Classifier to train the model using the "training" data split above.</p>

In [None]:
# Create a Pandas DataFrame
df_train = DataFrame('"demo"."churn_train"').to_pandas(all_rows = True)

# define the input columns and target variable:
X_train = df_train[df_train.columns.drop(['session_id', 'customer_id', 'b_cancel', 'sampleid'])]
y_train = df_train[['b_cancel']]

# Fit the Model
from xgboost import XGBClassifier

model = XGBClassifier()
model.fit(X_train, y_train)

<hr>

<b style = 'font-size:18px;font-family:Arial;color:#00233C'>Test the Model</b>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>It is common practice to measure the efficacy of a model.  For this demonstration, a Confusion Matrix is generated that shows the quantity of true vs. false positives and negatives for the model.</p> 

In [None]:
# Return a Pandas DataFrame from the split data above

df_test = DataFrame('"demo"."churn_test"').to_pandas(all_rows = True)

# Define the input columns and target
X_test = df_test[df_test.columns.drop(['session_id', 'customer_id', 'b_cancel', 'sampleid'])]
y_test = df_test[['b_cancel']]


# Predict the class and the probability of Fraud
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)


# Generate the Confusion Matrix
df_test[['prob_0', 'prob_1']] = y_prob
df_test['prediction'] = y_pred

cm = confusion_matrix(df_test['b_cancel'], df_test['prediction'])
disp = ConfusionMatrixDisplay(confusion_matrix = cm, display_labels = ['0', '1'])
fig, ax = plt.subplots(figsize=(8,8))
disp.plot(ax=ax)

plt.show()

#Get AUC score - anything over .75 is decent
AUC = roc_auc_score(df_test['b_cancel'], df_test['prediction'])
print(f'AUC: {AUC}')

<hr>

<b style = 'font-size:18px;font-family:Arial;color:#00233C'>Save the Model</b>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Save the model file in native xgboost format AND in PMML format to use in BYOM.</p> 

In [None]:
model.save_model('xgb_churn_model')

In [None]:
# Serialize PMML for BYOM
model.save_model('model.json')

# Writing features map to .fmap file

# Create an fmap file using the training data column types
i = 0
with open('model.fmap', 'w') as fmap_file:
    for key, value in X_train.dtypes.items():
        
        value = str(value)
        if 'int' in value:
            value = 'int'
        else:
            value = 'q'
        
        fmap_file.write('%d\t%s\t%s\n'%(i, key, value))
        i = i + 1

if not os.path.isfile('model.pmml'):
    os.system(str(jdk4py.JAVA) + ' -jar jpmml-xgboost-executable-1.5.5.jar --model-input model.json --fmap-input model.fmap --pmml-output model.pmml --target-name mm_fraud')

<hr>
<p style = 'font-size:24px;font-family:Arial;color:#00233C'><b>Demo 1 Summary</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>The preceding demonstration has illustrated a machine learning model development and testing pipeline that leverages both native and open source techniques to develop a unique "Churn" model by:</b></p>
<ul style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li><b>Connecting disparate data sets in native storage and open formats</b></li>
    <li><b>Executing powerful, unique, and scalable behavioral and sentiment anaysis functions</b></li>
    <li><b>Cleansing and combining these insights into a feature-rich training set by using familiar techniques, expressed to run at the source of the data and at scale</b></li>
    <li><b>Leveraging the innovation of Open-Source tools and techniques for model building and analysis</b></li>
    </ul>

<hr>
<p style = 'font-size:24px;font-family:Arial;color:#00233C'><b>Demo 2a - Deploy to Native Scoring Function</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>VantageCloud Lake <b>Analytic Clusters</b> combine the power and scale of native <b>ClearScape Analytics</b> Functions to prepare data that can be scored with a model using commmon serialization formats such as PMML, ONNX, or MOJO.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Enterprise Class customers report the ability to reduce data prep and model scoring times from several hours per run to seconds; effectively allowing model scoring in near-real-time.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>This demonstration will illustrate these key concepts:</p>

<table style = 'width:100%;table-layout:fixed;'>
    <tr>
        <td style = 'vertical-align:top' width = '40%'>
            <ol style = 'font-size:16px;font-family:Arial;color:#00233C'>
                <li>Load the PMML model to the Vantage system - capture model metadata and accuracy scores</li>
                <br>
                <li>Execute the Prediction against the prepared data.  <b>Note this runs on the Primary cluster today</b></li>
                <br>
                <li>Analyze the results of the process to determine ongoing model accuracty and efficacy</li>
            </ol>
        </td>
        <td></td>
    </tr>
</table>

<hr>

<b style = 'font-size:18px;font-family:Arial;color:#00233C'>Load the model to the Analytic Cluster</b>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Users can load model files, and add additional metadata if desired.</p> 

In [None]:
table_name = 'byom_models'
schema_name = 'demo'
model_id = 'xgb_churn_1'

try:
    delete_byom(model_id, table_name = table_name, schema_name = schema_name)
except Exception as e:
    if str(e.args).find('Failed to delete the model') >= 1:
        pass
    elif str(e.args).find('not found') >= 1:
        pass
    else:
        raise
        
model_metadata = {'Description': 'XGBoost Churn model',
                  'AUC' : float(AUC),
                  'ModelSavedDate': dt.date.today(),
                  'ModelSavedTime': dt.datetime.now().time()}

model_datatypes = {'Description':VARCHAR(100),
                   'AUC':FLOAT,
                   'ModelSavedDate':DATE(),
                   'ModelSavedTime':TIME()}

res = save_byom(model_id = model_id, 
                model_file = 'model.pmml', 
                schema_name = 'demo',
                table_name = table_name,
                additional_columns = model_metadata, 
                additional_columns_types = model_datatypes)

list_byom(schema_name = schema_name, table_name = table_name)

<hr>

<b style = 'font-size:18px;font-family:Arial;color:#00233C'>Execute the Scoring function</b>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The model scoring is executed via the <b>PMMLPredict</b> method, where we pass;</p>

<ul style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>The data set to be scored.  This a "Virtual" Dataframe, and represents the state of the data <b>In Vantage</b>.  In the case below, we pass the transformed data.  This could also be a set of methods or SQL queries above to prep and transform in-line</li>
    <li>The version of the model we wish to use</li>
    <li>Function parameters such as additional columns to return, and the model output fields</li>
    </ul>
    

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Finally, the scoring is only executed when needed - in this case to satisfy the head() method call.</p>

In [None]:
# Run the PMMLPredict function in Vantage

from teradataml import PMMLPredict

configure.byom_install_location = 'TD_MLDB'
tdf_predict = PMMLPredict(
            modeldata = retrieve_byom(model_id, table_name = table_name, schema_name = schema_name),
            model_output_fields = ['probability(1)', 'probability(0)'],
            newdata = DataFrame('"demo"."churn_test"'),
            accumulate = ['b_cancel','customer_id', 'session_id']
            ).result

tdf_predict.head()

<hr>

<b style = 'font-size:18px;font-family:Arial;color:#00233C'>Analyze the Results</b>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>It is common practice to measure the efficacy of a model.  Since the number of records in this data set is small, it can be copied to the client for visualization.</p> 

<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>Analyzing results at scale</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Under real-world conditions, it may be that the number of records that are being scored is too large to copy to the client - either due to network or client-side resource limitations.  To address these challenges, Teradata provides several model evaluation functions.  For Classification tasks, the <a href = 'https://docs.teradata.com/r/Teradata-VantageCloud-Lake/Analyzing-Your-Data/Analytics-Database-Analytic-Functions/Model-Evaluation-Functions/TD_ClassificationEvaluator'>TD_ClassificationEvaluator</a> function will create a confusion matrix in-database, and can be executed against data at any scale.</p>

In [None]:
# Copy the predictions to the client
# to generate the simple Confusion Matrix
# and print the AUC (Area Under Curve)

df_test = tdf_predict.to_pandas(all_rows = True)
df_test['prediction'] = df_test['probability(1)'].round(0).astype(int)
cm = confusion_matrix(df_test['b_cancel'].astype(int), df_test['prediction'])
disp = ConfusionMatrixDisplay(confusion_matrix = cm, display_labels = ['0', '1'])
fig, ax = plt.subplots(figsize=(8,8))
disp.plot(ax=ax)

plt.show()

#Get AUC score - anything over .75 is decent
AUC = roc_auc_score(df_test['b_cancel'].astype(int), df_test['prediction'].astype(int))
print(f'AUC: {AUC}')

<hr>
<p style = 'font-size:24px;font-family:Arial;color:#00233C'><b>Demo 2b - Deploy to Custom Container</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>VantageCloud Lake Edition <b>Analytic Clusters</b> combine the power and scale of native <b>ClearScape Analytics</b> Functions with the open and flexible runtime environments; offering users the flexibility to balance built-in data prep, transformation and feature engineering functions with custom code and models at massive scale.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Enterprise Class customers report the ability to reduce data prep and model scoring times from several hours per run to seconds; effectively allowing model scoring in near-real-time.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>This demonstration will illustrate these key concepts:</p>

<table style = 'width:100%;table-layout:fixed;'>
    <tr>
        <td style = 'vertical-align:top' width = '40%'>
            <ol style = 'font-size:16px;font-family:Arial;color:#00233C'>
                <li>Load the model and custom code to the Analytic Cluster Python Runtime</li>
                <br>
                <li>Execute the combined native query and the python scoring functions together, in parallel</li>
                <br>
                <li>Analyze the results of the process to determine ongoing model accuracty and efficacy</li>
            </ol>
        </td>
        <td><img src = 'images/OAF_Scoring.png' width = '600'></td>
    </tr>
</table>

<hr>

<b style = 'font-size:18px;font-family:Arial;color:#00233C'>Connect to the Environment Service</b>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>To better support integration with Cloud Services and common automation tools; the <b > User Environment Service</b> is accessed via RESTful APIs.  These APIs can be called directly or in the examples shown below that leverage the Python Package for Teradata (teradataml) methods.</p> 

In [None]:
# Configure base URL to the specific account service

try:
    ues_url
except NameError as e:
    ues_url = input('Please enter UES URL from Console: ')

# Configure base URL to the specific account service

# This URL comes from the environment panel in console; provided here in the environment
# object in vars.json
ues_url = session_vars['environment']['UES_URI']

# This python snippet parses the url to place values into appropriate spots for pftoken to obtain JWT
parsed_ues_url = urlparse(ues_url)
client_id = parsed_ues_url.netloc.split('.')[0]

if set_auth_token(ues_url = ues_url, username = username, pat_token = 'EXAMPLE', pem_file = 'path/to/pem.pem'):
    print('UES Authentication successful')
else:
    print('UES Authentication failed, check URL and account info')

# List available Base Python environments
print('Available Base Environments:')
ipydisplay(list_base_envs())

# List any custom environments
# Code here will catch any errors if there are no
# existing custom environments

print('Available User Environments:')
try:
    ipydisplay(list_user_envs())
except Exception as e:
    if str(e).find('No user environments found') > 0:
        print('No user environments found')
        pass
    else:
        raise

In [None]:
# Create a new environment, or connect to an existing one
# This demo code will create a fixed environment called "My_Scoring_Env"
# using a base Python 3.7 environment

try:
    demo_env = create_env(env_name = 'My_Scoring_Env',
                          desc = 'Demonstration dedicated python environment')
except Exception as e:
    if str(e).find('same name already exists') > 0:
        print('Environment already exists, obtaining a reference to it')
        demo_env = get_env('My_Scoring_Env')
        pass
    else:
        raise

sleep(5)

try:
    ipydisplay(list_user_envs())
except Exception as e:
    if str(e).find('No user environments found') > 0:
        print('No user environments found')
        pass
    else:
        raise

In [None]:
# Install any Python add-ons needed by the script in the user environment
# Using option asynchronous=True for an asychronous execution of the statement.
# Note: Avoid asynchronous installation when batch-executing all notebook statements,
#       as execution will continue even without installation being complete.
#
claim_id = demo_env.install_lib(['numpy','pandas','scikit-learn', 'xgboost'], asynchronous=True)

# Check the status of installation using status() API.
# Create a loop here for demo purposes

ipydisplay(demo_env.status(claim_id))
stage = demo_env.status(claim_id)['Stage'].iloc[-1]
while stage == 'Started':
    stage = demo_env.status(claim_id)['Stage'].iloc[-1]
    clear_output()
    ipydisplay(demo_env.status(claim_id))
    sleep(5)
    
# Verify the Python libraries have been installed correctly.
ipydisplay(demo_env.libs)

<hr>

<b style = 'font-size:18px;font-family:Arial;color:#00233C'>Install User Files in the Cluster Container</b>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Users can load any asset to the environment using the install_file method.  This ensures that only authenticated users can install specific files into a dedicated filesystem, and helps prevent malicious code injection.  Users pass the file name, and whether to replace an existing file.</p> 

In [None]:
# Install xgboost model file.
#
demo_env.install_file('xgb_churn_model', replace = True)

# Install the desired Python script into the environment.
demo_env.install_file('XGB_Churn_Scoring.py', replace = True)

# Verify the files have been installed correctly.
demo_env.files

<hr>

<b style = 'font-size:18px;font-family:Arial;color:#00233C'>Execute the Scoring function</b>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The script is executed via the <b>Apply</b> method, where we pass;</p>

<ul style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li>The data set to be scored.  This a "Virtual" Dataframe, and represents the state of the data <b>In Vantage</b>.  In the case below, we pass the transformed data.  This could also be a set of methods or SQL queries above to prep and transform in-line</li>
    <li>The command to run - in this case, calling the python runtime</li>
    <li>The format of the data being returned from the functions</li>
    <li>The custom container to execute the queries and code</li>
    </ul>
    

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Finally, the script is executed by calling the "execute_script" method; this "lazy" evaluation allows for more modular and performant architecture.</p>

In [None]:

apply_obj = Apply(data = DataFrame('"demo"."churn_test"'),
                  apply_command = 'python3 XGB_Churn_Scoring.py',
                  returns = {'customer_id': VARCHAR(20), 'session_id': VARCHAR(20),
                             'Prob_0': VARCHAR(30), 'Prob_1': VARCHAR(30), 
                             'Prediction':VARCHAR(2),'Actual': VARCHAR(2)},
                  env_name = demo_env,
                 )

In [None]:
# Execute the Python script inside the remote user environment.
# The result is a teradataml DataFrame. 
#
scored_data = apply_obj.execute_script()

# Only return five rows - minimize network overhead
scored_data.head(5)

<hr>

<b style = 'font-size:18px;font-family:Arial;color:#00233C'>Analyze the Results</b>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>It is common practice to measure the efficacy of a model.  For this demonstration, a "Confusion Matrix" is generated that shows the quantity of true vs. false positives and negatives for the model.</p> 

In [None]:
# Copy the predictions to the client
# to generate the simple Confusion Matrix
# and print the AUC (Area Under Curve)

df_test = scored_data.to_pandas(all_rows = True)
cm = confusion_matrix(df_test['Actual'].astype(int), df_test['Prediction'].astype(int))
disp = ConfusionMatrixDisplay(confusion_matrix = cm, display_labels = ['0', '1'])
fig, ax = plt.subplots(figsize=(8,8))
disp.plot(ax=ax)

plt.show()

#Get AUC score - anything over .75 is decent
AUC = roc_auc_score(df_test['Actual'].astype(int), df_test['Prediction'].astype(int))
print(f'AUC: {AUC}')

<hr>
<p style = 'font-size:24px;font-family:Arial;color:#00233C'><b>Demo 2 Summary</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>The preceding demonstration has shown one method for executing open-source models in a uniquely-scalable manner by leveraging the VantageCloud Lake Open Analytics Framework to:</b></p>
<ul style = 'font-size:16px;font-family:Arial;color:#00233C'>
    <li><b>Create a secure, isolated, and custom runtime container</b></li>
    <li><b>Deploy analytic artifacts including user code and models</b></li>
    <li><b>Execute the Operational pipeline including data preparation and data cleansing in-line with model execution and in parallel</b></li>
    <li><b>Expose the products of Machine Learning to the broadest possible audience</b></li>
    </ul>

<hr>
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>Clean up</b></p>

In [None]:
# uninstall the libraries from the environment first before removing it
demo_env.uninstall_lib(libs = demo_env.libs['name'].to_list())
remove_env('My_Scoring_Env')

In [None]:
remove_context()