<header>
   <p  style='font-size:36px;font-family:Arial;color:#F0F0F0; background-color: #00233c; padding-left: 20pt; padding-top: 20pt;padding-bottom: 10pt; padding-right: 20pt;'>
       Cancer Prediction using Sagemaker XGBoost estimator with tdapiclient
  <br>
       <img id="teradata-logo" src="https://storage.googleapis.com/clearscape_analytics_demo_data/DEMO_Logo/teradata.svg" alt="Teradata" style="width: 125px; height: auto; margin-top: 20pt;">
    </p>
</header>

<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Introduction</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Worldwide, breast cancer is the most common type of cancer in women and the second highest in terms of mortality rates. Diagnosis of breast cancer is performed when an abnormal lump is found (from self-examination or x-ray) or a tiny speck of calcium is seen (on an x-ray). After a suspicious lump is found, the doctor will conduct a diagnosis to determine whether it is cancerous and, if so, whether it has spread to other parts of the body. Vantage Clearscape Analytics provides us various machine learning techniques to develop predictive models for cancer diagnosis. Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass, using descriptions that define the characteristics of the cell nuclei.</p> 
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Amazon SageMaker is a platform for data scientists to train and deploy machine learning models as a service on the Amazon cloud accessible by address 'endpoints'. With the Teradata Vantage API_Request feature directly from Vantage, we can connect to these AWS endpoints through a function to do real-time scoring on data. Here, an Amazon SageMaker endpoint is used to orchestrate Extreme Gradient (XG) Boost model training and deploy the solution’s ML model. XGBoost is an open-source library providing access to a highly efficient implementation of the gradient boosting algorithm, an ensemble learning method using regression, and a loss function for model tuning to prevent overfitting.</p>

<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Business Values</b></p>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>Comprehensive health predictions and a reduced number of false positive and false negative results.</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>Reduced cost to patients and hospitals caused by cancer.</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>Identify patterns and symptoms leading to breast cancer to ensure early intervention.</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>Advanced research and development stemming from the results of the data and models produced.</li></p>
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Why Vantage?</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Machine Learning and AI have a proven track record of improving patient outcomes and well-being across the entire healthcare industry. Traditional approaches to data preparation, model development, and deployment rely on manual, error-prone processes that prevent enterprises from realizing the true value of these tools and techniques.</p>
 
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>However, Vantage provides these same proven data preparation and machine learning capabilities, integrated as native ClearScape Analytic functions.  This allows organizations to drastically reduce data preparation, model development, and testing time, while allowing for much more frequent and iterative testing and tuning to ensure maximum life-critical accuracy.</p>
 
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Furthermore, the exact same development pipeline can be deployed seamlessly to production, eliminating the traditional development-to-deployment gap in the ML and AI industry.</p>

<hr style="height:2px;border:none;background-color:#00233C;">

<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>1. Initial setup</b></p>
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Install packages</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>In the section, we will install the sagemaker and tdapiclient packages along with the necessary packages needed for the tdapiclient package.</p>

In [None]:
%%capture
# # '%%capture' suppresses the display of installation steps of the following packages
!pip install sagemaker
!pip install --upgrade numpy pyopenssl
!pip install tdapiclient 
!pip install google-cloud-aiplatform

<div class="alert alert-block alert-info">
<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>Note: </b><i>The above statements may need to be uncommented if you run the notebooks on a platform other than ClearScape Analytics Experience that does not have the libraries installed. If you uncomment those installs, be sure to restart the kernel after executing those lines to bring the installed libraries into memory. The simplest way to restart the Kernel is by typing zero zero: <b> 0 0</b></i></p>
</div>

<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Setting up AWS Sagemaker credentials</b></p>

<div class="alert alert-block alert-warning">
<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>Note: </b><i>This notebook cannot be executed if you do not have an AWS account with the necessary permission for AWS Sagemaker. Information regarding the AWS account and the necessary permissions is given below</i></p>
</div>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>Required AWS Credentials:</b>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'><b>AWS_ACCESS_KEY_ID:</b> Specifies AWS Access Key ID.</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'><b>AWS_SECRET_ACCESS_KEY:</b> Specifies AWS Secret Access Key.</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'><b>AWS_REGION:</b> Specifies the AWS region.
If this is defined, it overrides the values in the environment variable AWS_DEFAULT_REGION.</li></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>How to Get These Inputs:</b>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and AWS_REGION: These credentials are related to your AWS account and subscription. If you already have an AWS account and an active subscription, you can find these credentials in the AWS portal. Here's how:
    
<ul style="font-size: 16px; font-family: Arial;;color:#00233C">
            <li><a href="https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html">Find your AWS Access Key ID</a></li>
            <li><a href="https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html">Find your AWS Secret Access Key</a></li>
            <li><a href="https://docs.aws.amazon.com/awsconsolehelpdocs/latest/gsg/select-region.html">Select a AWS Region</a></li>
        </ul>
</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We will also have to create a S3 bucket where we will be uploading our code and artifacts when using the XGBoost functions. If you already have a bucket created you can use the same else to create a new bucket check this <a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/creating-bucket.html">link</a>.</p> 

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Since we are going to use the AWS account for Sagemaker access, we will need specific permission for the user to use Sagemaker. For granting access for Sagemaker please check the the link <a href= "https://docs.aws.amazon.com/sagemaker/latest/dg/api-permissions-reference.html">Amazon SageMaker API Permissions</a></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We can also create a new role using the<a href= " https://docs.aws.amazon.com/sagemaker/latest/dg/role-manager.html">  AWS Sagemaker Role Manager </a></p>    
   
</p>


<p style = 'font-size:16px;font-family:Arial;color:#00233C'>In case you do not have an AWS account you can create one by following the steps mentioned <a href="https://aws.amazon.com/free/?trk=2738afd4-9401-4d18-8e3e-1b1c194dea07&sc_channel=ps&ef_id=EAIaIQobChMIkbKO_a7NhAMV1aJmAh28rQdyEAAYASAAEgL26PD_BwE:G:s&s_kwcid=AL!4422!3!509606977827!p!!g!!aws%20amazon%20com!12618685604!120373367976&gclid=EAIaIQobChMIkbKO_a7NhAMV1aJmAh28rQdyEAAYASAAEgL26PD_BwE&all-free-tier.sort-by=item.additionalFields.SortRank&all-free-tier.sort-order=asc&awsf.Free%20Tier%20Types=*all&awsf.Free%20Tier%20Categories=*all">here</a>.</p> 

<hr style="height:2px;border:none;background-color:#00233C;">

<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>2. Connect to Vantage</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>In the section, we import the required libraries and set environment variables and environment paths (if required).</p>

In [None]:
# Import necessary libraries.
import warnings
warnings.filterwarnings('ignore')
warnings.simplefilter(action='ignore', category=DeprecationWarning)
warnings.simplefilter(action='ignore', category=RuntimeWarning)
warnings.simplefilter(action='ignore', category=FutureWarning)

import getpass
import sagemaker
import tdapiclient 

from tdapiclient import create_tdapi_context, remove_tdapi_context,TDApiClient
from teradataml import *
import pandas as pd
import numpy as np
from teradatasqlalchemy.types import *
import os
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt
display.max_rows = 5
configure.val_install_location = "val"

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We will be prompted to provide the password. We will enter the password, press the Enter key, and then use the down arrow to go to the next cell. Begin running steps with Shift + Enter keys.</p>

In [None]:
%run -i ../startup.ipynb
eng = create_context(host = 'host.docker.internal', username='demo_user', password = password)
print(eng)

In [None]:
%%capture
execute_sql('''SET query_band='DEMO=CancerPrediction_tdapiClient_Sagemaker.ipynb;' UPDATE FOR SESSION; ''')

<hr style="height:2px;border:none;background-color:#00233C;">

<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>3. Getting Data for This Demo</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We have provided data for this demo on cloud storage. We have the option of either running the demo using foreign tables to access the data without using any storage on our environment or downloading the data to local storage, which may yield somewhat faster execution. However, we need to consider available storage. There are two statements in the following cell, and one is commented out. We may switch which mode we choose by changing the comment string.</p>   


In [None]:
# %run -i ../run_procedure.py "call get_data('DEMO_CancerPrediction_cloud');"
 # Takes about 50 seconds
%run -i ../run_procedure.py "call get_data('DEMO_CancerPrediction_local');"
 # Takes about 2 minute 30 secs

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Optional step – We should execute the below step only if we want to see the status of databases/tables created and space used.</p>

In [None]:
%run -i ../run_procedure.py "call space_report();"

<hr style="height:2px;border:none;background-color:#00233C;">

<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>4. Analyze the raw data set</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Let us start by creating a teradataml dataframe. A "Virtual DataFrame" that points directly to the dataset in Vantage.</p>



In [None]:
# Creating a teradataml dataframe using the table.
df = DataFrame(in_schema("DEMO_CancerPrediction","Patient_Data"))
df

<hr style="height:2px;border:none;background-color:#00233C;">

<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>5. Data Preparation</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Label encoding a categorical data column is done to re-express existing values of a column (variable) into a new coding scheme or to correct data quality problems and focus an analysis of a particular value. It allows
    for mapping individual values, NULL values, or any number of remaining values (ELSE option) to a new value, a NULL value or the same value. Label encoding supports charter, numeric, and date type columns.</p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Output of this function is passed to "label_encode" argument of "Transform" function from Vantage Analytic Library.</p>

In [None]:
# Encoding the target column using label encoder.
from teradataml import LabelEncoder 
rc = LabelEncoder(values=("M", 1), columns=["diagnosis"], default=0)

In [None]:
feature_columns_names= Retain(columns=["radius_mean","texture_mean","perimeter_mean","area_mean","smoothness_mean",         
                                       "compactness_mean","concavity_mean","concave_points_mean","symmetry_mean",    
                                       "fractal_dimension_mean","radius_se","texture_se","perimeter_se","area_se",
                                       "smoothness_se","compactness_se","concavity_se","concave_points_se",
                                       "symmetry_se","fractal_dimension_se","radius_worst","texture_worst",
                                       "perimeter_worst","area_worst","smoothness_worst","compactness_worst",
                                       "concavity_worst","concave_points_worst","symmetry_worst",
                                       "fractal_dimension_worst" ])

<p style = 'font-size:16px;font-family:Arial;color:#00233C'> The Variable Transformation analysis reads a teradataml DataFrame and produces an output containing transformed columns. This is useful when preparing data for input to an analytic algorithm. For example, a K-Means Clustering algorithm typically produces better results when the input columns are first converted to their Z-Score values to put all input variables on an equal footing, regardless of their magnitude.</p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Function supports following transformations:</p>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'><code>Binning</code> - Binning replaces a continuous numeric column with a categorical one to produce ordinal values (for example, numeric categorical values where order is meaningful).</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'><code>Derive</code> - The Derive transformation requires the free-form transformation be specified as a formula.</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'><code>One Hot Encoding</code> - One Hot Encoding is useful when a categorical data element must be re-expressed as one or more numeric data elements, creating a binary numeric field for each categorical data value.</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'><code>Missing Value</code> Treatment or Null Replacement.</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'><code>Label Encoding</code> - Allows to re-express existing values of a categorical data column (variable) into a new coding scheme or to correct data quality problems and focus an analysis on a value.</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'><code>Min-Max Scaling</code> - Limits the upper and lower boundaries of the data in a continuous numeric column using a linear rescaling function based on maximum and minimum data values.</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'><code>Retain</code> - Allows copying of one or more columns into the final analytic data set.</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'><code>Sigmoid</code> - Provides rescaling of continuous numeric data using a type of sigmoid or s-shaped function.</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'><code>ZScore</code> - Provides rescaling of continuous numeric data using Z-Scores.</li></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Here, we will be using the Lable Encode option for the diagnosis column</p>

In [None]:
data = valib.Transform(data=df, label_encode=rc,index_columns="id",unique_index=True,retain=feature_columns_names)
df=data.result
df

In [None]:
# Re-arranging columns such that the target column is first and there's no header in the dataset.
df=df.drop("id",axis=1)

In [None]:
df= df.select(["diagnosis","radius_mean","texture_mean","perimeter_mean","area_mean","smoothness_mean",         
               "compactness_mean","concavity_mean","concave_points_mean","symmetry_mean","fractal_dimension_mean",   
               "radius_se","texture_se","perimeter_se","area_se","smoothness_se","compactness_se","concavity_se",
               "concave_points_se","symmetry_se","fractal_dimension_se","radius_worst","texture_worst",
               "perimeter_worst","area_worst","smoothness_worst","compactness_worst","concavity_worst",
               "concave_points_worst","symmetry_worst","fractal_dimension_worst" ])

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We use the sample function on teradataml dataframe. This function allows to sample few rows from dataframe directly or based on conditions. It creates a new column 'sampleid' which has a unique id for each sample, it helps to uniquely identify each sample.</p>

In [None]:
# Create 3 samples of input data - sample 1 will have 60% of total rows and sample 2 and 3 will have 20% of total rows.
cancer_sample = df.sample(frac=[0.6, 0.2,0.2])
cancer_sample

In [None]:
# Create train dataset from sample 1 by filtering on "sampleid" and drop "sampleid" column as it is not required for training model.
train = cancer_sample[cancer_sample.sampleid == "1"].drop("sampleid", axis = 1)
train

In [None]:
# Create validate dataset from sample 2 by filtering on "sampleid" and drop "sampleid" column as it is not required for training model.
validate = cancer_sample[cancer_sample.sampleid == "2"].drop("sampleid", axis = 1)
validate

In [None]:
# Create test dataset from sample 3 by filtering on "sampleid" and drop "sampleid" column as it is not required for training model.
test = cancer_sample[cancer_sample.sampleid == "3"].drop("sampleid", axis = 1)
test

<hr style="height:2px;border:none;background-color:#00233C;">

<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>6. Amazon Sagemaker details and Vantage API settings</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The teradataml SageMaker extension library (tdapiclient) is for client development tools, such as Jupyter Notebook, both local desktop or SageMaker notebooks. To use the teradataml SageMaker extension library, you need to have valid credentials (temporary or permanent) from AWS as well as from the database.</p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We will have to create the following environment variables before using the module.</p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>Required environment variables:</b></p>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'><b>AWS_ACCESS_KEY_ID:</b> Specifies AWS Access Key ID.</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'><b>AWS_SECRET_ACCESS_KEY:</b> Specifies AWS Secret Access Key.</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'><b>AWS_REGION:</b> Specifies the AWS region.
If this is defined, it overrides the values in the environment variable AWS_DEFAULT_REGION.</li></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The API_Request in-database function is used for model scoring and inference calculations.Before using the API_Request in-database function, the AWS endpoint of Amazon SageMaker or analytic model should have been trained and deployed on AWS.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We also need the endpoint address, region and AWS credentials (which have permissions to use this in-database function) to execute a function to score Vantage data with this AWS analytic service.</p>
    
    
<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>**Please enter your valid AWS credentials.</b></p>



In [None]:
s3_bucket = input("S3 Bucket Path(Please give bucket name followed by sub-folder(if any exists) for example: bucket-name/sub-folder) :")
access_id = input("Access ID:")
access_key = getpass.getpass("Access Key: ")
region = input("AWS Region: ")
exec_role = input("AWS Sagemanker execution role: ")

# Example of the required values
# s3_bucket = "sage-demo"
# access_id = "AKIAQ5EGP9D********"
# access_key = "j4WIGWfbe6VWMpVFAzzz***************"
# region = "us-east-3"
# exec_role_arn = "arn:aws:iam::************:role/sagemker-demo-role"

<div class="alert alert-block alert-warning">
<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>Note: </b><i>If the credentials entered are not valid or do not have the necessary permissions, we may get an error in the estimator step.</i></p>
</div>

In [None]:
# Assign entered credentials


# Bucket location where your custom code will be saved in the tar.gz format.
custom_code_upload_location = "s3://{}/xgboost/code".format(s3_bucket)

# Bucket location where results of model training are saved.
model_artifacts_location = "s3://{}/xgboost/artifacts".format(s3_bucket)

os.environ["AWS_ACCESS_KEY_ID"] = access_id
os.environ["AWS_SECRET_ACCESS_KEY"] = access_key
os.environ["AWS_REGION"] = region

tdapi_context = create_tdapi_context("aws", bucket_path=s3_bucket) 
td_apiclient = TDApiClient(tdapi_context)

<hr style="height:2px;border:none;background-color:#00233C;">

<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>7. SageMaker XGBoost with tdapiclient</b></p>

<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Using XGBoost estimator with tdapiclient</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We create XGBoost Sagemaker estimator instance through TDApiClient. To run our training script on SageMaker, we construct a sagemaker.xgboost.estimator.XGBoost estimator, which accepts several constructor arguments:</p>

<li style = 'font-size:16px;font-family:Arial;color:#00233C'><code>entry_point</code>: The path to the Python script that SageMaker runs for training and prediction.</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'><code>role (str or PipelineVariable)</code> – An AWS IAM role name or ARN. Amazon SageMaker Processing uses this role to access AWS resources, such as data stored in Amazon S3.</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'><code>framework_version (str)</code> – The version of the framework. Value is ignored when image_uri is provided.</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'><code>instance_count (int or PipelineVariable)</code> – The number of instances to run a processing job with.</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'><code>instance_type (str or PipelineVariable)</code> – The type of EC2 instance to use for processing, for example, ‘ml.c4.xlarge’.</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'><code>code_location (str)</code> – The S3 prefix URI where custom code will be uploaded (default: None). The code file uploaded to S3 is ‘code_location/job-name/source/sourcedir.tar.gz’. If not specified, the default code location is ‘s3://{sagemaker-default-bucket}’.</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'><code>hyperparameters (optional)</code>: A dictionary passed to the train function as hyperparameters.</li></p>



In [None]:
exec_role_arn = exec_role
xgboost_estimator = td_apiclient.XGBoost(
    entry_point="script.py",
    role=exec_role_arn,
    output_path=model_artifacts_location,
    code_location=custom_code_upload_location,
    instance_count=1,
    instance_type="ml.m5.xlarge", 
    framework_version="1.3-1",
    trainingSparkDataFormat="csv",
    trainingContentType="csv")

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The hyperparamters that can be used in the XGBoost model are </p>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'><code>num_class:</code> The number of classes. Required if objective is set to multi:softmax or multi:softprob.</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'><code>num_round:</code> The number of rounds to run the training.</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'><code>eta:</code> Step size shrinkage used in updates to prevent overfitting. After each boosting step, you can directly get the weights of new features. The eta parameter actually shrinks the feature weights to make the boosting process more conservative.</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'><code>eval_metric:</code> Evaluation metrics for validation data. </li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'><code>max_depth:</code> Maximum depth of a tree. Increasing this value makes the model more complex and likely to be overfit. 0 indicates no limit. A limit is required when grow_policy  = depth-wise.</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'><code>max_leaves:</code> Maximum number of nodes to be added. Relevant only if grow_policy is set to lossguide.</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'><code>min_child_weight:</code> Minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, the building process gives up further partitioning. In linear regression models, this simply corresponds to a minimum number of instances needed in each node. </li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'><code>subsample:</code> Subsample ratio of the training instance. Setting it to 0.5 means that XGBoost randomly collects half of the data instances to grow trees. This prevents overfitting.</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'><code>csv_weights:</code> When this flag is enabled, XGBoost differentiates the importance of instances for csv input by taking the second column (the column after labels) in training data as the instance weights.</li>

</p>

In [None]:
xgboost_estimator.set_hyperparameters(max_depth=5, 
                        eta=0.2, 
                        gamma=4,
                        min_child_weight=6, 
                        subsample=0.8, 
                        csv_weights=1,
                        num_round=30)
                        



<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Training xgboost estimator using teradataml dataframe objects</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>This wrapper function execute the SageMaker.fit method of Amazon SageMaker using the teradataml DataFrame as source for training. The fit method copies the data to S3 and then invokes any of the AWS Python API's callable through tdapiclient, as listed in teradataml Extension.</p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Required Argument:</p>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>inputs: Specifies a teradataml DataFrame or S3 path as a string. It can be one of the following: Single object of teradataml DataFrame, String or Dictionary of string to object of teradataml DataFrame</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'>content_type: Specifies the content type for inputs.
Default value is CSV. </li>
</p>

In [None]:
# Start training using DataFrame objects.
xgboost_estimator.fit({'train': train, 'validation': validate }, content_type="csv",wait=True)  


<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Deploy XGBoost Estimator using Serializer/Deserializer </b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The TDApiClient.deploy method deploys Amazon SageMaker model to Vantage or AWS. SageMaker.deploy method of Amazon SageMaker estimator class is executed, allowing integration with Teradata at the time of scoring. This function returns an instance of TDSagemakerPredictor.</p>

<p style = 'font-size:18px;font-family:Arial;color:#00233C'>Required Arguments:</p>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'><code>platform:</code> Specifies the platform to which the given model will be deployed.</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'><code>Accepted values:</code> "vantage", "aws-endpoint".</li>

<li style = 'font-size:16px;font-family:Arial;color:#00233C'><code>sagemaker_p_args:</code> Specifies all positional parameters required for the original SageMaker.deploy method.</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'><code>sagemaker_kw_args:</code> Specifies all kwarg parameters required for the original SageMaker.deploy method. We have used the below kwarg parameters:</li>
<ol style = 'font-size:16px;font-family:Arial;color:#00233C'>
<li style = 'font-size:14px;font-family:Arial;color:#00233C'><code>initial_instance_count (int)</code> – The initial number of instances to run in the Endpoint created from this Model. If not using serverless inference, then it need to be a number larger or equals to 1 (default: None) </li>

<li style = 'font-size:14px;font-family:Arial;color:#00233C'><code>instance_type (str)</code> – The EC2 instance type to deploy this Model to. For example, ‘ml.p2.xlarge’, or ‘local’ for local mode. If not using serverless inference, then it is required to deploy a model. (default: None) </li>

<li style = 'font-size:14px;font-family:Arial;color:#00233C'><code>serializer (BaseSerializer)</code> – A serializer object, used to encode data for an inference endpoint (default: None). If serializer is not None, then serializer will override the default serializer. The default serializer is set by the predictor_cls. </li>

<li style = 'font-size:14px;font-family:Arial;color:#00233C'><code>deserializer (BaseDeserializer)</code> – A deserializer object, used to decode data from an inference endpoint (default: None). If deserializer is not None, then deserializer will override the default deserializer. The default deserializer is set by the predictor_cls.</li></ol></p>   
    

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The ´xgboost_estimator´ we are using is a TDApiClient object holding a model which has been trained on AWS, we will deploy it to an AWS endpoint.</p>

In [None]:
from sagemaker.serializers import CSVSerializer
from sagemaker.deserializers import CSVDeserializer
csv_ser = CSVSerializer()
csv_dser = CSVDeserializer()

In [None]:
sg_kw = {
    "instance_type": "ml.m5.large",
    "initial_instance_count": 1,
    "serializer": csv_ser,
    "deserializer": csv_dser
}
predictor = xgboost_estimator.deploy("aws-endpoint", sagemaker_kw_args=sg_kw)

<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Predict using the tdapiclient predictor object </b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We will use the TDPredictor.predict method to perform prediction using teradataml DataFrame and SageMaker endpoint represented by this predictor object.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Required Arguments:</p>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'><code>input:</code> Specifies the teradataml DataFrame used as input for scoring.</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'><code>mode:</code> Specifies the mode for scoring.
<ol style = 'font-size:16px;font-family:Arial;color:#00233C'>Permitted values include:
<li style = 'font-size:14px;font-family:Arial;color:#00233C'><code>'UDF':</code> Score in database using a Teradata UDF. This is the default value. For this mode, the return is a teradataml DataFrame. This mode provides faster scoring with the data from Teradata.</li>

<li style = 'font-size:14px;font-family:Arial;color:#00233C'><code>'CLIENT':</code> Score at client side using a library. For this mode, the return is an array or JSON. When using mode, data is pulled from Teradata and serialized for scoring at client.</li></ol></p>



<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Optional Argument:
<ol style = 'font-size:16px;font-family:Arial;color:#00233C'>options: Specifies the predict method with the following key-value arguments:
<li style = 'font-size:14px;font-family:Arial;color:#00233C'><code>udf_name:</code> Specifies the name of the UDF used to invoke predict with UDF mode. Default value is 'tapidb.API_Request'.</li>
<li style = 'font-size:14px;font-family:Arial;color:#00233C'><code>content_type:</code> Specifies content type required for SageMaker endpoint present in the predictor. Default value is 'csv'.</li>
<li style = 'font-size:14px;font-family:Arial;color:#00233C'><code>key_start_index:</code> Specifies the index in DataFrame columns to be the key for scoring starts. Default value is 0.</li></ol>



In [None]:
# Now let's try prediction with UDF and Client options.
item=test
output = predictor.predict(item, mode="UDF",content_type='csv')
output

In [None]:
pred_df = output.assign(pred=output.Output.cast(type_=FLOAT))
# pred = output.assign(diag_pred=func.round(pred.prediction))
pred_df = pred_df.assign(diagnosis_prediction=pred_df.pred.round(0).cast(type_=INTEGER))
pred_df = pred_df.assign(diagnosis_prediction=pred_df.diagnosis_prediction.cast(type_=VARCHAR(2)))
pred_df

In [None]:
from teradataml import ConvertTo
converted_data = ConvertTo(data = pred_df,
                           target_columns = ['diagnosis'],
                           target_datatype = ["VARCHAR(charlen=2,charset=LATIN,casespecific=NO)"])
pred_df=converted_data.result
pred_df

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The output column diagnosis_prediction will show the predicted values for occurence of cancer in the patients.</p>

<hr style="height:2px;border:none;background-color:#00233C;">

<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>8. Evaluate the model</b></p>

<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>8.1 Classification Evaluator</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The ClassificationEvaluator() function evaluate and emits various metrics of classification model based on its predictions on the data. Apart from accuracy, the secondary output data returns micro, macro, and weighted-averaged metrics of precision, recall, and F1-score values.</p>
    


In [None]:
ClassificationEvaluator_obj = ClassificationEvaluator(data=pred_df,
                                                          observation_column='diagnosis',
                                                          prediction_column='diagnosis_prediction',
                                                          labels=['0','1'])


In [None]:
df_metrics = ClassificationEvaluator_obj.output_data
df_metrics

<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>8.2 Show AUC-ROC Curve</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The <a href = 'https://docs.teradata.com/search/all?query=TD_ROC&content-lang=en-US'>ROC</a> curve shows the performance of a binary classification model as its discrimination threshold varies. For a range of thresholds, the curve plots the true positive rate against false-positive rate.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>This function accepts a set of prediction-actual pairs as input and calculates the following values for a range of discrimination thresholds.</p>
    <ul style = 'font-size:16px;font-family:Arial;color:#00233C'>
        <li style = 'font-size:16px;font-family:Arial;color:#00233C'>True-positive rate (TPR)</li>
        <li style = 'font-size:16px;font-family:Arial;color:#00233C'>False-positive rate (FPR)</li>
        <li style = 'font-size:16px;font-family:Arial;color:#00233C'>The area under the ROC curve (AUC)</li>
        <li style = 'font-size:16px;font-family:Arial;color:#00233C'>Gini coefficient</li>
        <li style = 'font-size:16px;font-family:Arial;color:#00233C'>Other details are mentioned in the documentation</li>
    </ul>



In [None]:
from teradataml import ROC 
roc_df = ROC(data = pred_df, 
                    probability_column = "pred",
                    observation_column = "diagnosis",
                    positive_class="1"
                    )
roc_df.output_data

In [None]:
auc = roc_df.result.get_values()[0][0]
auc

<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Plot ROC Curves</b></p>

In [None]:
plot_roc_df = roc_df.output_data
plot =  plot_roc_df.plot(x=plot_roc_df.fpr, y=plot_roc_df.tpr,
                         title="Receiver Operating Characteristic (ROC) Curve",
                         xlabel='False Positive Rate', 
                         ylabel='True Positive Rate', 
                         color="blue",
                         legend=f'AUC = {round(auc, 4)}',
                         legend_style='lower right',
                         grid_linestyle='--',
                         grid_linewidth=0.5)
 
# Display the plot.
plot.show()

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The closer the ROC curve is to the upper left corner of the graph, the higher the accuracy of the test because in the upper left corner, the sensitivity = 1 and the false positive rate = 0 (specificity = 1). The ideal ROC curve thus has an AUC = 1.0. As seen in the above graph the AUC for both the models is close to 1 so the accuracy of both models is very good. </p>

<hr style="height:1px;border:none;background-color:#00233C;">
<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>8.3 Show Confusion Matrix</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Confusion Matrix is a performance measurement for machine learning classification problem where output can be two or more classes. It is a table with 4 different combinations of predicted and actual values.</p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Confusion matrices represent counts from predicted and actual values. The output “TN” stands for True Negative which shows the number of negative examples classified accurately. Similarly, “TP” stands for True Positive which indicates the number of positive examples classified accurately. The term “FP” shows False Positive value, i.e., the number of actual negative examples classified as positive; and “FN” means a False Negative value which is the number of actual positive examples classified as negative.</p>


In [None]:
confusion_matrix_df = pred_df.to_pandas(all_rows=True)
cm = confusion_matrix(confusion_matrix_df['diagnosis'], confusion_matrix_df['diagnosis_prediction'])
cmd = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['DoesNotHaveCancer', 'HasCancer'],)
cmd.plot(cmap='Blues', colorbar=False)
plt.show()

<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Conclusion</b></p>

<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Thus we have seen that with the Teradata Vantage API_Request feature, we can connect to AWS endpoints through a function to do real-time scoring on data. An Amazon SageMaker endpoint was used to orchestrate Extreme Gradient (XG) Boost model training and deploy the solution’s ML model. Vantage and ClearScape Anlaytics has helped drastically reduce data preparation, model development, and testing time, while allowing for much more frequent and iterative testing and tuning to ensure maximum life-critical accuracy.</p>


<hr style="height:2px;border:none;background-color:#00233C;">
<p style = 'font-size:20px;font-family:Arial;color:#00233C'><b>9. Cleanup</b></p>


<p style = 'font-size:18px;font-family:Arial;color:#00233C'><b>Databases and Tables</b></p>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>We will use the following code to clean up tables and databases created for this demonstration.</p>

In [None]:
%run -i ../run_procedure.py "call remove_data('DEMO_CancerPrediction');" 
#Takes 40 seconds

<div class="alert alert-block alert-danger">
<p style = 'font-size:16px;font-family:Arial;color:#00233C'><b>**Note: </b><i>Please make sure to delete the AWS Sagemaker model and endpoints after use using the code in below cell. If these are not deleted the cost will keep increasing till the time it is not deleted.</i></p>
</div>

In [None]:
predictor.cloudObj.delete_model()
predictor.cloudObj.delete_endpoint()
remove_tdapi_context(tdapi_context)

In [None]:
remove_context()

<hr style="height:2px;border:none;background-color:#00233C;">
<b style = 'font-size:20px;font-family:Arial;color:#00233C'>Required Materials</b>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>Let’s look at the elements we have available for reference for this notebook:</p>
<b style = 'font-size:18px;font-family:Arial;color:#00233C'>Dataset:</b>
<p style = 'font-size:16px;font-family:Arial;color:#00233C'>The dataset for this analysis has been taken from 
<a href = 'https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic'>Breast Cancer Wisconsin (Diagnostic) - UCI Machine Learning Repository.</a>

<b style = 'font-size:18px;font-family:Arial;color:#00233C'>Filters:</b> 
<li style = 'font-size:16px;font-family:Arial;color:#00233C'><b>Industry:</b> Healthcare</li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'><b>Functionality:</b> Machine Learning</li> 
<li style = 'font-size:16px;font-family:Arial;color:#00233C'><b>Use Case:</b> Prediction Analysis</li></p>
<b style = 'font-size:18px;font-family:Arial;color:#00233C'>Related Resources:</b>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'><a href = 'https://usc-word-edit.officeapps.live.com/we/%E2%80%A2%09https:/www.teradata.com/Blogs/Predicting-Heart-Failure-with-Teradata'>Saving Lives, Saving Costs: Predicting Heart Failure with Teradata</a> </li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'><a href = 'https://www.teradata.com/Blogs/Hyper-scale-time-series-forecasting-done-right'>Hyper-scale time series forecasting done right</a></li>
<li style = 'font-size:16px;font-family:Arial;color:#00233C'><a href = 'https://www.teradata.com/Blogs/Forecasting-COVID-19-Using-Teradata-Vantage'>Forecasting COVID-19 Using Teradata Vantage</a></li>



<footer style="padding-bottom:35px; background:#f9f9f9; border-bottom:3px solid #00233C">
    <div style="float:left;margin-top:14px">ClearScape Analytics™</div>
    <div style="float:right;">
        <div style="float:left; margin-top:14px">
            © 2023, 2024 Teradata. All rights reserved.
        </div>
    </div>
</footer>