

<header>
   <p  style='font-size:36px;font-family:Arial; color:#F0F0F0; background-color: #00233c; padding-left: 20pt; padding-top: 20pt;padding-bottom: 10pt; padding-right: 20pt;'>
       Leveraging Open Source Machine Learning with ClearScape Analytics and Open Analytics Framework
  <br>
       <img id="teradata-logo" src="images/TeradataLogo.png" alt="Teradata" style="width: 125px; height: auto; margin-top: 20pt;">
    </p>
</header>

    
    
<div style="background-color: #f57c00; color: white; padding: 10px; border-radius: 5px; display: flex; align-items: center; justify-content: space-between; width: 99%; min-height: 64px;">
    <div style="display: flex; align-items: center;">
        <span style="font-size: 24px; margin: 22px;">⚠️</span>
        <div>
            <strong>This demo requires Open Analytics Framework</strong><br>
            You need to have Open Analytics Framework enabled for this environment. If you have not done it already, go back to <a href="https://clearscape.teradata.com/dashboard" style="color: white; text-decoration: underline;">ClearScape Analytics Experience dashboard</a> to request access.
        </div>
    </div>
    <a href="https://docs.teradata.com/r/Lake-Analyze-Your-Data-with-ClearScape-AnalyticsTM/Build-Scalable-Analytics-with-Open-Analytics-Framework/Introduction-to-Vantage-Open-Analytics/What-is-Open-Analytics-Framework" style="color: white; display: flex; align-items: center; white-space: nowrap; margin: 20px">
        Learn more
        <img src="images/new-tab-icon.png" alt="New Window Icon" style="width: 24px; height: 24px; margin-left: 5px;">
    </a>
</div>

<p>Open-source Machine Learning, AI, and Advanced Analytics tools, techniques, and resources offer enterprises limitless opportunities to drive new insights and business value from their internal and external data landscape.  Unfortunately, with these opportunities come significant challenges to realizing success.  Some of these challenges include:</p>
<ul>
    <li><b>Performance and Scale.</b>  Many popular tools and techniques are designed to run on a single user's environment; drastically limiting the ability to deploy against enterprise scale data sets and support operational SLAs.  Special-purpose distributed computing architectures only support specialized libraries, limiting capabilities while increasing complexity.</li>
    <li><b>Stability and Security.</b>  Most organizations limit the use of user-generated code or models in production; for good reason.  Poorly-written code, or inefficient libraries can over-consume production resources - or worse - create a major security risk.</li>
    <li><b>Consistency.</b>  Model performance, accuracy, and predictive stability are all very sensitive to environmental dependencies and package versioning.  Maintaining consistent, repeatable, and operationally stable environments in production is a heavily manual and fragile process.</li></ul>
        
        
        
<p>VantageCloud Lake Edition <b>Open Analytics Framework</b> is the only enterprise-class platform that addresses these challenges with a simple, powerful architecture.  The following demonstration will illustrate how users can use <b>any</b> open-source tool or package of choice, deploy it to a custom, isolated environment; and then execute in parallel and at massive scale.</p>

<hr>

<b style = 'font-size:28px;font-family:Arial'>Environment Overview</b>

<p>This demonstration utilizes a VantageCloud Lake <b>Analytic Cluster</b> architecture, using the shared data sets created in the previous demonstration.  Specifically the "Txn_History" data that represents "CashApp" style transaction history stored in the Vantage Object File System (OFS).</p>

<p>The high level process is as follows:</p>

<table style = 'width:100%;table-layout:fixed;'>
    <tr><td style = 'vertical-align:top' width = '40%'>
            <ol>
                <li>The Data Scientist conducts analytics activities using his or her own python tools and packages of choice, then connects to VantageCloud Lake through teradataml client library and teradatasql python driver.</li>
                <br>
                <li>Teradataml provides APIs to create and manage artifacts in User Environment Service, including custom libraries, dependencies, model artifacts, and scoring scripts.  The user can leverage these APIs to create one or many custom, dedicated environments to host their code.</li>
                <br>
                <li>The Data Scientist will then execute their pipeline that will;
                    <ul><li>Call ClearScape Analytics functions on Compute Clusters (data prep, transformation, etc.)</li>
                        <li>Prepared data is passed to the python container running in parallel on cluster nodes.</li>
                        <li>Results (inference/predictions) are returned as "virtual" dataframes; where the data resides in Vantage</li>
                        <li>Data can be persisted in the Object Filesystem, written to open object storage, or copied to the client</li>
                    </ul></li>
            </ol>
        </td><td><img src = 'images/OAF_Overview.png' width = '600'></td></tr>
</table>

<b style = 'font-size:28px;font-family:Arial;'>Demonstration Overview</b>

<p>This notebook consists of three primary demonstrations</p>
<ol>
    <li><b>Custom Environment Management</b> - Create a server-side, custom python container with explicit package and versions installed</li>
    <li><b>File Management</b> - Upload model files, scoring scripts, and any other asset type</li>
    <li><b>Analytics</b> - Execute powerful feature engineering and statistical functions and pass this directly to the python container running in parallel</li>
    <li><b>Appendix - Model Training and Testing</b> - The process for creating and testing the model using open-source tools is provided in the Appendix</li>
    </ol>



<hr>

<p style = 'font-size:18px;font-family:Arial'><b>Python Package Imports</b></p>

<p>Standard practice to import required packages and libraries; execute this cell to import packages for Teradata automation as well as machine learning, analytics, utility, and data management packages.</p> 

In [None]:
# install other required packages
%pip install xgboost

In [None]:
# Import the Python library teradataml and the specific environment setup modules.
#
import warnings
from teradataml import *
from db_utils import *
warnings.filterwarnings('ignore')
display.suppress_vantage_runtime_warnings = True

from IPython.display import display as ipydisplay
from IPython.display import clear_output 

from sklearn.metrics import accuracy_score, roc_auc_score, confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt
#
# Account for the data types to be used with the script.
#
from teradatasqlalchemy.types import BIGINT, VARCHAR, FLOAT, INTEGER
from collections import OrderedDict
#
# Other case-specific imports.
#
import json, os, sys, getpass
import pandas as pd
from time import sleep

# container name - set here for easier notebook navigation
### User will also be asked to change it ###
oaf_name = 'OAF_demo_env'
###########################
print(f'using "{oaf_name}" for the OAF environment')

# get the current python version to match deploy a custom container
python_version = str(sys.version_info[0]) + '.' + str(sys.version_info[1])
print(f'Using Python version {python_version} for user environment')

<hr>

<p style = 'font-size:18px;font-family:Arial'><b>Connect to Vantage</b></p>

<p>Before performing any operations in Vantage, we need to connect to the system.  The below code will read in a variables file (vars.json - this has been used in prior environment setup and data engineering examples) and will connect to Vantage with this information.  The Vantage connection is referred to as a "Context" - a common python-rdbms connection architecture.</p> 

In [None]:
# load vars json
with open('../../vars.json', 'r') as f:
    session_vars = json.load(f)

# Create the SQLAlchemy Context
host = session_vars['environment']['host']
username = session_vars['hierarchy']['users']['business_users'][1]['username']
password = session_vars['hierarchy']['users']['business_users'][1]['password']

# UES Authentication information
ues_url = session_vars['environment']['UES_URI']
configure.ues_url = ues_url
pat_token = session_vars['hierarchy']['users']['business_users'][1]['pat_token']
pem_file = session_vars['hierarchy']['users']['business_users'][1]['key_file']

compute_group = session_vars['hierarchy']['users']['business_users'][1]['compute_group']

# check for existing connection
eng = check_and_connect(host=host, username=username, password=password, compute_group = compute_group)
print(eng)

In [None]:
# check cluster status
res = check_cluster_start(compute_group = compute_group)

<hr>
<p style = 'font-size:28px;font-family:Arial'><b>Demo 1 - Custom Container Management</b></p>



<p>The Teradata Vantage Python Client Library provides simple, powerful methods for the creation and maintenance of custom Python runtime environments <b>in the VantageCloud environment</b> .  This allows practitioners complete control over the behavior and quality of their model performance and analytic accuracy running on the Analytic Cluster.  The following demonstration will show how easy it is to create a custom xgboost-based scoring environment.</p>

<img src = 'images/Container_Layout.png' width = '70%'>

<p><b>Custom environments are persistent.</b> Users only need to create these once and then can be saved, updated, or modified only as needed.</p>

<hr>
<p style = 'font-size:18px;font-family:Arial'><b>Container Management Process</b></p>
<table style = 'width:100%;table-layout:fixed;'>
    <tr>
        <td style = 'vertical-align:top' width = '40%'>
            <ul>
                <li>Set up a connection to the Environment Service</li>
                <br>
                <li>Create a unique User Environment based on available base images</li>
                <br>
                <li>Install custom libraries and specifc versions if required</li>
                <br>
                <li>Monitor packages installation/view installed packages</li>
            </ul>
        </td>
        <td><img src = 'images/OAF_Env.png' width = '600'></td>
    </tr>
</table>

<hr>

<p style = 'font-size:18px;font-family:Arial'><b>Connect to the Environment Service</b></p>

<p>To better support integration with Cloud Services and commong automation tools; the <b style = 'color:#00b2b1'> User Environment Service</b> is accessed via RESTful APIs.  These APIs can be called directly or in the examples shown below that leverage the Python Package for Teradata (teradataml) methods.</p> 

<p><b>In order to properly authenticate to the UES infrastructure, the user must log in with the same credentials that are used to connect to the database.  When the following cell executes, follow the instructions to open a browser window, and log in with that user.</b></p>

In [None]:
# check to see if there is a valid UES auth
# if not, authenticate
try:
    demo_env = get_env(oaf_name)
    print('Existing valid UES token')

except Exception as e:
    if '''NoneType' object has no attribute 'value''' in str(e) or '''Failed to execute get_env''' in str(e):
        if set_auth_token(ues_url = ues_url, username = username, pat_token = pat_token, pem_file = pem_file):
            print('UES Authentication successful')
        else:
            print('UES Authentication failed, check URL and account info')
        pass
    else:
        raise
    

<hr>

<p style = 'font-size:18px;font-family:Arial'><b>Create a Custom Container in Vantage</b></p>

<p>If desired, the user can create a <b>new</b> custom environment by starting with a "base" image and customizing it.  The steps are:</p> 
<ul>
    <li>List the available "base" images the system supports</li>
    <li>List any existing "custom" environments the user has created</li>
    <li>If there are no custom environments, then create a new one from a base image</li>
    </ul>

In [None]:
# List available Base Python environments

ipydisplay(list_base_envs())

In [None]:
# Create a new environment, or connect to an existing one

try:
    ipydisplay(list_user_envs())
except Exception as e:
    
    if str(e).find('No user environments found') > 0:
        print('No user environments found')
        pass
    else:
        raise

print('Use an existing environment, or create a new one:')
print(f'OAF Environment is set to {oaf_name}.')
print('Enter to accept, or input a new value.')
print('If the environment is not in the list, a new one will be created')
i = input()
if len(i) != 0:
    oaf_name = i
    print(f'OAF Environment is now {oaf_name}')

try:
    demo_env = create_env(env_name = oaf_name,
                      base_env = f'python_{python_version}',
                      desc = 'OAF Demo environment')
except Exception as e:
    if str(e).find('same name already exists') > 0:
        print('Environment already exists, obtaining a reference to it')
        demo_env = get_env(oaf_name)
        pass
    elif 'Invalid value for base environment name' in str(e):
        print('Unsupported base environment version, using defaults')
        demo_env = create_env(env_name = oaf_name,
                      desc = 'OAF Demo environment')
    else:
        raise

# Note create_env seems to be asynchronous - sleep a bit for it to register
sleep(5)

try:
    ipydisplay(list_user_envs())
except Exception as e:
    if str(e).find('No user environments found') > 0:
        print('No user environments found')
        pass
    else:
        raise

<hr>

<p style = 'font-size:18px;font-family:Arial'><b>Install Dependencies</b></p>

<p>The second step in the customization process is to install Python package dependencies.  This set of code:
</p> 

<ul>
    <li>Will list any installed packages.  If this is a new environment, there will be few packages.</li>
    <li>Calls the "install" method, and users will pass a list of packages and (optionally) versions to install.</li>
    <li>This demonstration code has a short loop to "monitor" installation status.  Since this is a remote operation, it's important to understand any problems or warnings.</li>
    </ul>

In [None]:
# View existing libraries in the user environment.
demo_env.libs

In [None]:
# Install any Python add-ons needed by the script in the user environment
# Using option asynchronous=True for an asychronous execution of the statement.
# Note: Avoid asynchronous installation when batch-executing all notebook statements,
#       as execution will continue even without installation being complete.
#
claim_id = demo_env.install_lib(['numpy','pandas','scikit-learn', 'xgboost==1.6.2'], asynchronous=True)

In [None]:
# Check the status of installation using status() API.
# Create a loop here for demo purposes

ipydisplay(demo_env.status(claim_id))
stage = demo_env.status(claim_id)['Stage'].iloc[-1]
while stage == 'Started':
    stage = demo_env.status(claim_id)['Stage'].iloc[-1]
    clear_output()
    ipydisplay(demo_env.status(claim_id))
    sleep(5)
    
# Verify the Python libraries have been installed correctly.
ipydisplay(demo_env.libs)

<hr>
<p style = 'font-size:28px;font-family:Arial'><b>Demo 2 - Install Custom Models and Scripts</b></p>

<p>Once the custom runtime environment has been created, the user can then load custom user-created assets.  For the purposes of this Demonstration, we will load two files;</p>

<ol>
    <li><b>'xgb_model'</b> - This is a simple XGBoost Classifier model that was trained on the "Financial Fraud" data in the OFS table.  It has an accuracy score of approximately 97.4%.  The Appendix provides the code used to train, test, and save this model file.</li>
    <br>
    <li><b>'Demo_XBG_Scoring.py'</b> - This file is a simple python program that acts as the bridge between EDW processing on the Analytics Cluster and the XGBoost model scoring.  It simply formats the incoming data, calls the model, and outputs the model predictions.  When executed on the individual parallel Analytic Cluster Nodes, it will us the XGBoost model file to score it's portion of the data.</li>
    </ol>
    
<p>Once again, the Vantage Python Library makes this process straightforward by calling two simple methods:</p>

<table style = 'width:100%;table-layout:fixed;'>
    <tr>
        <td style = 'vertical-align:top' width = '40%'>
            <ul>
                <li>"install_file" for each of the two assets</li>
                <br>
                <li>Verification using the "files" property</li>
            </ul>
        </td>
        <td><img src = 'images/Model.png' width = '600'></td>
    </tr>
</table>

<hr>

<p style = 'font-size:18px;font-family:Arial'><b>Install User Files in the Cluster Container</b></p>

<p>Users can load any asset to the environment using the install_file method.  This ensures that only authenticated users can install specific files into a dedicated filesystem, and helps prevent malicious code injection.  Users pass the file name, and whether to replace an existing file.</p> 

In [None]:
# Install xgboost model file.
#
demo_env.install_file('xgb_model', replace = True)

# Install the desired Python script into the environment.
demo_env.install_file('Demo_XGB_Scoring.py', replace = True)

<hr>

<p style = 'font-size:18px;font-family:Arial'><b>List all installed files</b></p>

<p>files property lists the asset, size, and last updated timestamp.  As above, these methods are available to manage the container remotely, since these containers live in the Vantage environment.</p> 

In [None]:
# Verify the files have been installed correctly.
demo_env.files

<hr>
<p style = 'font-size:28px;font-family:Arial'><b>Demo 3 - Model Scoring at Scale</b></p>

<p>VantageCloud Lake Edition <b>Analytic Clusters</b> combine the power and scale of native <b>ClearScape Analytics</b> Functions with the open and flexible runtime environments; offering users the flexibility to balance built-in data prep, transformation and feature engineering functions with custom code and models at massive scale.</p>

<p>Enterprise Class customers report the ability to reduce data prep and model scoring times from several hours per run to seconds; effectively allowing model scoring in near-real-time.</p>

<p>This demonstration will illustrate these key concepts:</p>

<table style = 'width:100%;table-layout:fixed;'>
    <tr>
        <td style = 'vertical-align:top' width = '40%'>
            <ul>
                <li>Leverage native data preparation functions to process incoming data for the model scoring</li>
                <br>
                <li>Execute the combined native query and the python scoring functions together, in parallel</li>
                <br>
                <li>Analyze the results of the process to determine ongoing model accuracty and efficacy</li>
            </ul>
        </td>
        <td><img src = 'images/OAF_Scoring.png' width = '600'></td>
    </tr>
</table>

<hr>

<p style = 'font-size:18px;font-family:Arial'><b>Data Transformation/Feature Engineering</b></p>

<p style = 'font-size:16px;font-family:Arial'>Create a reference to the data set in Vantage, and apply powerful transformation functions directly on the Data. <b style = 'color:#00b2b1'>ClearScape Analytics</b> is a suite of in-database massively-parallel-processing functions for statistical analysis, data cleaning and transformation, machine learning, text analytics, and model scoring.  Practictioners can leverage these functions together with open-source modeling as illustrated here, or create powerful, native end-to-end pipelines using just these functions.</p>

<img src = 'images/In_DB_Functions.png'>

In [None]:
# Create a reference to the data set in-Vantage
# by creating a "Teradata DataFrame"
# which is a reference to the data.


tdf_test = DataFrame('"demo_ofs"."txn_history"')

# Only retrieve a small subset of rows to verify the connection
tdf_test.head(5)

<hr>

<p style = 'font-size:18px;font-family:Arial'><b>Engineer Features</b></p>

<p>Call the ClearScape <b>One Hot Encoding</b> function to transform the categorical column into numeric features.</p>

In [None]:
# Perform native one-hot encoding on the data
# These functions use a "fit-and-transform" pattern
# that supports reuse and easier operationalization of the transformation process

from teradataml import OneHotEncodingFit, OneHotEncodingTransform

res_ohe = OneHotEncodingFit(data = tdf_test, 
                            target_column = 'txn_type', 
                            categorical_values = ['CASH_OUT', 'CASH_IN', 'TRANSFER', 'DEBIT', 'PAYMENT'], 
                            other_column = 'other',
                            is_input_dense = True)

res_transformed = OneHotEncodingTransform(data = tdf_test, object = res_ohe.result, is_input_dense = True)
res_transformed.result.head(5)

<hr>

<p style = 'font-size:18px;font-family:Arial'><b>Execute the Scoring function</b></p>

<p>Now that the categorical column has been encoded, the XGBoost model can be called.  This is executed via the <b>Apply</b> method, where we pass;</p>

<ul>
    <li>The data set to be scored.  This a "Virtual" Dataframe, and represents the state of the data <b>In Vantage</b>.  In the case below, we pass the transformed data <i>less</i> the columns we don't need for scoring, put together using method chaining</li>
    <li>The command to run - in this case, calling the python runtime</li>
    <li>The format of the data being returned from the functions</li>
    <li>The custom container to execute the queries and code</li>
    </ul>
    

<p>Finally, the script is executed by calling the "execute_script" method; this "lazy" evaluation allows for more modular and performant architecture.</p>


In [None]:

apply_obj = Apply(data = res_transformed.result.drop(['step', 'nameOrig', 'nameDest', 'isFlaggedFraud'], axis = 1),
                  apply_command = 'python3 Demo_XGB_Scoring.py',
                  returns = {'txn_id': VARCHAR(20), 'Prob_0': VARCHAR(30), 
                             'Prob_1': VARCHAR(30), 'Prediction':VARCHAR(2),
                             'Actual': VARCHAR(2)},
                  env_name = demo_env,
                 )

In [None]:
# Execute the Python script inside the remote user environment.
# The result is a teradataml DataFrame. 
#


scored_data = apply_obj.execute_script()

# Only return five rows - minimize network overhead
scored_data.head(5)

<hr>

<p style = 'font-size:18px;font-family:Arial'><b>Analyze the Results</b></p>

<p>It is common practice to measure the efficacy of a model.  For this demonstration, a "Confusion Matrix" is generated that shows the quantity of true vs. false positives and negatives for the model.</p> 

In [None]:
# Copy the predictions to the client
# to generate the simple Confusion Matrix
# and print the AUC (Area Under Curve)

df_test = scored_data.to_pandas(all_rows = True)
cm = confusion_matrix(df_test['Actual'].astype(int), df_test['Prediction'].astype(int))
disp = ConfusionMatrixDisplay(confusion_matrix = cm, display_labels = ['0', '1'])
fig, ax = plt.subplots(figsize=(10,10))
disp.plot(ax=ax)

plt.show()

#Get AUC score - anything over .75 is decent
AUC = roc_auc_score(df_test['Actual'].astype(int), df_test['Prediction'].astype(int))
print(f'AUC: {AUC}')

<hr>

<p style = 'font-size:18px;font-family:Arial'><b>Disconnect from Vantage</b></p>

<p>Once complete, one can remove the custom environment (if desired) and close the "context" to the Vantage system.</p> 

In [None]:
# check cluster status
res = check_cluster_stop(compute_group = compute_group)

In [None]:
# uninstall the libraries from the environment first before removing it
demo_env.uninstall_lib(libs = demo_env.libs['name'].to_list())
remove_env(demo_env.env_name)

In [None]:
remove_context()

<hr>
<p style = 'font-size:28px;font-family:Arial'><b>Appendix - Model Training and Evaluation</b></p>

<p>VantageCloud Lake Edition <b>Analytic Clusters</b> and <b>ClearScape Analytics</b> functions can also be leveraged for model training.  This brief addendum shows an abbreviated process for developing and testing an open-source fraud detection model with Vantage and XGBoost.</p>

<hr>

<p style = 'font-size:18px;font-family:Arial'><b>Connect to Vantage</b></p>

<p>If necessary, connect to Vantage. If the context is still valid from above this doesn't need to be run.  The below code will read in a variables file (vars.json - this has been used in prior environment setup and data engineering examples) and will connect to Vantage with this information.  The Vantage connection is referred to as a "Context" - a common python-rdbms connection architecture.</p> 

In [None]:
# load vars json
with open('vars.json', 'r') as f:
    session_vars = json.load(f)

# Create the SQLAlchemy Context
host = session_vars['environment']['host']
username = session_vars['hierarchy']['users']['business_users'][1]['username']
password = session_vars['hierarchy']['users']['business_users'][1]['password']

# UES Authentication information
ues_url = session_vars['environment']['UES_URI']
configure.ues_url = ues_url
pat_token = session_vars['hierarchy']['users']['business_users'][1]['pat_token']
pem_file = session_vars['hierarchy']['users']['business_users'][1]['key_file']

compute_group = session_vars['hierarchy']['users']['business_users'][1]['compute_group']

# check for existing connection
eng = check_and_connect(host=host, username=username, password=password, compute_group = compute_group)
print(eng)

<hr>

<p style = 'font-size:18px;font-family:Arial'><b>Get a reference to the data</b></p>

<p>Create a <b>Teradataml DataFrame</b> which references the data set in Vantage.  This could be a table stored in direct-attach block storage, Performance-Optimized Object Storage (<b>OFS</b>), or stored in an open format in any Object Store.</p> 

<p>Teradataml DataFrames do not copy data into local memory, so complex analytic and transformation operations can run against data at any scale, while leveraging the parallel processing and workload isolation of Vantage Analytic Clusters.</p> 

In [None]:
# Updated variables to insure they are the same
tdf_test = DataFrame('"demo_ofs"."txn_history"')
tdf_test.head(5)

<hr>

<p style = 'font-size:18px;font-family:Arial'><b>Engineer Features</b></p>

<p>Call the ClearScape <b>One Hot Encoding</b> function to transform the categorical column into numeric features.</p>

In [None]:
from teradataml import OneHotEncodingFit, OneHotEncodingTransform

res_ohe = OneHotEncodingFit(data = tdf_test, 
                            target_column = 'txn_type', 
                            categorical_values = ['CASH_OUT', 'CASH_IN', 'TRANSFER', 'DEBIT', 'PAYMENT'], 
                            other_column = 'other',
                            is_input_dense = True)

res_transformed = OneHotEncodingTransform(data = tdf_test, object = res_ohe.result, is_input_dense = True)
res_transformed.result.head(5)

<hr>

<p style = 'font-size:18px;font-family:Arial'><b>Design for Operations</b></p>

<p>Persist the "Fit" table to reuse it for the Operational transformation of new data</p>

In [None]:
# copy the fit table to a permanent table for use later
res = copy_to_sql(res_ohe.result, table_name = 'OHE_FIT_TABLE', schema_name = 'demo_ofs', if_exists = 'replace')

<hr>

<p style = 'font-size:18px;font-family:Arial'><b>Test/Train Split</b></p>

<p>Extraordinarily fast "Sample" function can split the data into multiple data sets in seconds.</p>

In [None]:
tdf_samples = res_transformed.result.sample(frac = [0.2, 0.8])
copy_to_sql(tdf_samples[tdf_samples['sampleid'] == 2], table_name = 'txns_train', schema_name = 'demo_ofs', if_exists = 'replace')
copy_to_sql(tdf_samples[tdf_samples['sampleid'] == 1], table_name = 'txns_test', schema_name = 'demo_ofs', if_exists = 'replace')

<hr>

<p style = 'font-size:18px;font-family:Arial'><b>Train the Model</b></p>

<p>Use open-source XGBoost Classifier to train the model using the "training" data split above.</p>

In [None]:
# Create a Pandas DataFrame
df_train = DataFrame('"demo_ofs"."txns_train"').to_pandas(all_rows = True)

# define the input columns and target variable:
X_train = df_train[['txn_type_CASH_OUT', 'txn_type_CASH_IN', 'txn_type_TRANSFER',
       'txn_type_DEBIT', 'txn_type_PAYMENT', 'txn_type_other', 'amount','oldbalanceOrig', 'newbalanceOrig',
       'oldbalanceDest', 'newbalanceDest']]
y_train = df_train[['isFraud']]

In [None]:
# Fit the Model
warnings.filterwarnings('ignore')
from xgboost import XGBClassifier

model = XGBClassifier()
model.fit(X_train, y_train)

<hr>

<p style = 'font-size:18px;font-family:Arial'><b>Test the Model</b></p>

<p>It is common practice to measure the efficacy of a model.  For this demonstration, a "Confusion Matrix" is generated that shows the quantity of true vs. false positives and negatives for the model.</p> 

In [None]:
# Return a Pandas DataFrame from the split data above

df_test = DataFrame('"demo_ofs"."txns_test"').to_pandas(all_rows = True)

# Define the input columns and target
X_test = df_test[['txn_type_CASH_OUT', 'txn_type_CASH_IN', 'txn_type_TRANSFER',
       'txn_type_DEBIT', 'txn_type_PAYMENT', 'txn_type_other', 'amount','oldbalanceOrig', 'newbalanceOrig',
       'oldbalanceDest', 'newbalanceDest']]
y_test = df_test[['isFraud']]


# Predict the class and the probability of Fraud
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)


# Generate the Confusion Matrix
df_test[['prob_0', 'prob_1']] = y_prob
df_test['prediction'] = y_pred

cm = confusion_matrix(df_test['isFraud'], df_test['prediction'])
disp = ConfusionMatrixDisplay(confusion_matrix = cm, display_labels = ['0', '1'])
fig, ax = plt.subplots(figsize=(10,10))
disp.plot(ax=ax)

plt.show()

#Get AUC score - anything over .75 is decent
AUC = roc_auc_score(df_test['isFraud'], df_test['prediction'])
print(f'AUC: {AUC}')

<hr>

<p style = 'font-size:18px;font-family:Arial'><b>Save the Model</b></p>

<p>Save the model file in native xgboost format.  This is used above in the main demonstration.</p> 

In [None]:
model.save_model('xgb_model')

In [None]:
remove_context()